UNCLASSIFIED 

.n  4  3  7  3  2  4 


DEFENSE  DOCUMENTATION  CENTER 

FOR 

SCIENTIFIC  AND  TECHNICAL  INFORMATION 

CAMERON  STATION.  ALEXANDRIA.  VIRGINIA 


Reproduced  From 
Best  Available  Copy 


NOTICE:  VIhen  government  or  other  drawings#  8p'*cl  ^ 
ficatlons  or  other  data  are  used  for  any  p\Mrpor '• 
other  than  in  connection  with  a  definitely  re) ed 
govemment  procurement  operation,  the  U.  S. 
Government  thereby  incurs  no  reBponaihility,  nc;-  any 
obligation  whatsoever}  and  the  fact  that  th”  Grr'. 
ment  may  have  fonaulated,  furnished#  or  in  any  way 
supplied  the  said  drawings,  specifications,  or  other 
data  Is  not  to  be  regarded  by  implication  or  oth^  •• 
wise  as  in  any  manner  licensing  the  holder  or  any 
other  person  or  corporation,  or  conveying  any  rights 
or  permission  to  manufacture,  use  or  sell  any 
patented  invention  that  may  in  any  way  be  relatcu 
thereto. 


REPRODUCTION  QUALITY  NOTICE 


This  document  is  the  best  quality  available.  The  copy  furnished 
to  DTIC  contained  pages  that  may  have  the  following  quality 
problems: 

•  Pages  smaller  or  larger  than  normal. 

•  Pages  with  background  color  or  light  colored  printing. 

•  Pages  with  small  type  or  poor  printing;  and  or 

•  Pages  with  continuous  tone  material  or  color 
photographs. 

Due  to  various  output  media  available  these  conditions  may  or 
may  not  cause  poor  legibility  in  the  microfiche  or  hardcopy  output 
you  receive. 


iLl  I  If  this  block  is  checked,  the  copy  furnished  to  DTIC 
contained  pages  with  color  printing,  that  when  reproduced  in 
Black  and  White,  may  change  detail  of  the  original  copy. 


^  -I  ~  f 


*7 


AFCRL  -  64  -  86 


MULTIDIMENSIONAL  MODEL  FOR  AUTOMATIC  SPEECH  RECOGNITION 

B.  Y,  BHIMANI 

BHiMANI  RESEARC.H  ASSOCIATES 
1  838  Massactiuselts  Avcnui- 
Loxiiigton  73,  MassachusuLts 


FINAL  REPORT 

Conlraft  No.  AF  1  9(6.!!B)-2766 
Project  4610 
Task  461002 

February  14,  1964 


Piapan-d  fur; 


AIR  FORCE  CAMBRIDGE  RESEARCH  LABOKA'J'ORIES 
OFFICE  OF  AEROSPACE  RESEARCH 
UNITED  STATES  AIR  FORCE 
BEDFORD,  MASSACHUSETTS 


Requests  lor  additional  copies  by  Agencies  of  the  Department  of 
Defense,  their  contractors,  and  other  Goviirnment  agencies  should 
be  directed  (o  the: 

DEFENSE  DOCUMENTATION  CENTER  (DDC) 
CAMERON  STATION 
ALEXANDRIA,  VIRGINIA 

Department  of  Defense  contractors  must  be  established  for  DDC 
service:!  or  have  their  'need-1o-know'  certified  by  the  cognizant 
military  agi'ney  of  their  project  or  contrai  t. 

All  other  persons  and  organizations  shouhi  apply  to  the; 

U.  ,S,  DEPARTMENT  OF  COMMERCE 
OFFICE  OF  TI'iCIINICAL  .SEKVICh  S 
WA.SIIINO  rON  -ih,  D.  C. 


AFCRL  -  64  -  85 


MULTIDIMENSIONAL  MODEL  FOR  AUTOMATIC  SPEECH  RECOGNITION 

B.  V.  BHIMANI 

BHIMANI  RESEARCH  ASSOCIATES 
1838  Massachusetts  Avenue 
Lexington  73,  Massachusetts 


FINAL  REPORT 

Contract  No.  AF  19(6Z8)-2766 
Project  4610 
Task  461002 

February  14,  1964 


Prepared  for; 


MR  FORCE  CAMBRIDGE;  RESEIARCII  LABORATORIES 
OFFICE  OF  AEROSPACE  RESEARCH 
UNITED  STATES  AIR  FORCE 
BEDFORD,  MASSACHUSETTS 


ABSTRACT 


The  purpose  of  this  study  is  to  provide  a  theoretical  basis  for  a 
general  purpose  speech  recognizer.  The  research  has  focused  upon  the 
nature  of  normal  speech,  which  can  he  distinguished  from  discrete 
articulation  by  the  continuous  movement  (in  normal  speech)  of  arliculators 
from  one  position  to  another;  as  a  result,  sounds  in  continuous  speech 
are  more  likely  to  modify  the  production  of  surrounding  sounds  than 
they  are  in  discrete  speech. 

Assuming  that,  according  to  tlie  ergodic  theory,  sound  changes 
occurring  in  everyday  speech  reflect  and  repeat  the  changes  which  have 
occurred  in  the  hiotorical  development  of  language  (because  the  physical 
modes  of  speed)  production  are  the  saing),  linguistic  examples  and 
theories  of  sound  change  were  studied  From  this  study,  a  body  of  rules 
for  sound  change  or  euphonic  combination  was  derived  and  their  applicability 
to  the  English  language  tested.  These  rules  represent  an  error-correcting 
code  to  restore  omitted  or  indefinite  word  boundaries  and/or  to  restore  the 
orthographic  phone  classes  which  are  altered  in  continuous  speech. 

The  study  required  the  evaluation  of  existing  research  and  theories, 
as  well  as  the  generation  of  some  original  data,  the  latter  e.ouui  sting  of 
high-quality  recordings  of  continuous  speed)  sai))ples.  Dotli  original  data 
and  previously  published  data  were  subjecUed  to  acoustic  analysis  of 
))tinute  portions  of  tlie  speed)  wiivefuDn.  These  )ncasur emenls  both 
suggested  and  justified  a  principle  of  segmiuiting  speech,  to  be  used  in 
conjunction  will)  the  representations  of  speed)  sounds  in  Uie  multidimensional 
model  (according  to  the  degree  of  freedom  i)i  various  dimensions  of  their 
production),  and  the  above  menlionid  error  correcting  code,  to  delineate 
a  new  conception  of  a  gonerai  purpose  rei:ogiii /.e  r. 


TABLE  OF  CONTENTS 

SUBJECT  TITLE  PAGE  NUMBER 

PREFACE  i 

SECTION  1;  LINGUISTIC  ASPECTS  OF  SPEECH: 

GENERAL  PROBLEMS  OF  AUTOMATIC 

SPEECH  RECOGNITION  1 

I,  HISTORY  AND  DESCRIPTION  OF 

PHONEMIC  THEORY . 2 

U.  TECHNIQUES  OF  PHONEMIC  ANALYSIS.  .  7 

SECTION  2;  GENERAL  DISCUSSION  OF  THE 

MULTIDIMENSIONAL  MODEL 

INTRODUCTION . 11 

I.  DISCUSSION  OF  THE  DIMENSIONS,.  ...  12 

A.  MANNER  OF  ARTICULATION, 

PLACE  OF  ARTICULATION, 

AND  RESONANCE . 12 

II.  VOWELS . 20 

A.  SEGMENTATION . 21 

B.  NON -DISTINCTIVE  CONSONANT 

DIFFERENCES  DEPENDING  ON 
FOLLOWING  VOWEL . 21 

Ill.  DURAITON . 24 

A.  SOMi:  C.ON,SIDERATIONS  ON  THE 

NORMAU/.ATiON  Ui:'  DURATION  .  .  2h 

B.  THi:  IMPORTANCE  OF  DURA.TION 

MEA.SUREMENTS  TO  SPEECH 
RECOGNITION . 27 

C.  MET  HODS  OF  INDICATING  DURATION 

IN  OUR  MODEL . 32 

IV.  INTENSITY . 33 

V.  FUNDAMENTAL  FREQUENCY  37 

A.  THE  CORREl.ATION  BETWEEN 
Fl.'NDAM>:N'L  AL  TTD.'QUENC  Y 
LEVELS  .\NI.  T'ORMANT  LEVELS.  .  37 

B.  HIGH  FUNDAMENTAL  FREQUENCY 

AND  THE  ACCURACY  OF  FORMANT 
MEASUREMENTS . 38 


SUBJECT  TITUS 


PAGE  NUMBER 


C.  PITCH  AND  FUNDAMENTAL 

FREQUENCY . 38 

D.  PITCH  AS  USED  IN  SPEECH  ....  38 

E.  FUNDAMENTAL  FREQUENCY 

AND  ACCENT . 38 

SECTION  3:  SOUND  CHANGE  AND  THE  MULTIDI  - 

MENSIONAL  MODEL 

INTRODUCTION . 40 


I.  THE  NATURE  OF  SOUND  CHANGE  .  ...  41 

11.  THE  CAUSES  OF  SOUND  CHANGE  ....  46 

III.  SOUND  CHANGES  CONSIDERED  IN 

TERMS  OF  THE  DIMENSIONS  OF  OUR 
MODEL . 50 

A.  CHANGES  IN  MANNER  OF 

ARTICULATION . 

B.  CHANGES  IN  PLACE  OF 

ARTICULATION . 

C.  RESONANCES  . 

D.  HOW  SOUNDS  DROP  OUT . 

E.  DURATION  . 

F.  INTENSITY . 

G.  FUNDAMENTAL  FREQUENCY  .  .  . 

H.  GLOTTAL  ADJUSTMENTS  .... 

IV.  SOUND  CHANGES  INVOLVING  "PROBLEM" 
PHONES . 54 

A.  CHANGES  INVOLVING  PHONES  WITH 

A  SINGLE  PLACE  OF  ARTICULATION  ,54 

B.  CHANGES  INVOLVING  PHONES  WITH 

TWO  PLACES  OF  ARTICUL'^TION  .  .  55 

V.  THE  RELEVANCE  OF  SANDHI  RULES  OF 


SANSKRIT  TO  OUR  MODEL  ......  55 

VI.  REPRESENTATION  OF  RULES  FOR 

EUPHONIC  COMBINATION . 59 

SECTION  4;  ACOUSTIC  CONSIDERATIONS  OF  SPEECH 

INTRODUCTION . 61 


1.  BACKGROUND  OF  ACOUSTIC  WORK.  .  .  63 


51 

51 

51 

52 

52 

53 

53 

54 


SUBJECT  TITLE 


PAGE  NUMBER 


SECTION  5: 


II.  COARTICULATION  AND  DURATION  ...  64 

A.  STUDIES  IN  THE  IMPORTANCE 


OF  COARTICULATION . 65 

B.  THE  ACOUSTIC  EFFECT  OF 

CHANGES  IN  DURATION . 70 


in.  THE  APPLICATION  OF  AVAILABLE 
ACOUSTIC  DATA  TO  THE  NEEDS  OF 
A  GENERAL  PURPOSE  RECOGNIZER,  .  .  72 

IV.  APPLICATION  OF  THE  RULES  OF 
EUPHONIC  COMBINATION  TO  CON- 


TINIUOUS  SPEECH .  74 

A.  DISCUSSION  OF  SPEECH  DATA  80 

D.  DATA  ON  "PROBLEM"  PHONE 

CLASSES . 81 


C.  VERIFICATION  OF  COARTICULATION 
AND  EUPHONIC  COMBINATION  ...  87 

V.  FURTHER  MEASUREMENTS  WHICH  INDICATE 
THE  IMPORTANCE  OF  DURATION  AND 
INTENSITY,  AND  WHICH  SUBSTANTIATE 
OUR  APPROACH . 90 

SEGMENTATION  AND  CONSIDERATIONS 
FOR  COMPUTER  OPERATIONS 

INTRODUCTION .  94 

1.  Sli.'CMENTATlON .  94 

A.  CONSONAN'J'  CLUSTERS  .....  96 

B.  REFINEMENT  OF  THE  CONCEPT 

OF  COARTICULATTON  ......  97 

II,  INFORMATION  ON  THE  Of '.OURRENCK 

OF  RULES .  98 

III,  OUTLINE  OI'  APPROACHES  TO  THE 

COMPUTER  PROGRAM .  99 


SECTION  6; 


CONCLUSIONS 


101 


SUBJECT  TITLE 


PAGE  NUMBER 


APPENDIX  A  PHONETIC  ANALYSIS  OF 

VIETNAMESE  . . 105 

APPENDIX  B  CHARTS  OF  THE  CONSONANT 

CATEGORIES . 110 

APPENDIX  C  A  NOTE  ON  PALATOGRAMS . 117 

APPENDIX  D  REVIEW  OF  MEYER'S  WORK  ON 

DURATION . 118 

APPENDIX  E  REVIEW  OF  OTHER  DURATION  STUDIES  .  .  121 

APPENDIX  F  METHODS  OF  LINGUISTIC  RECONSTRUCTION.  128 

APPENDING  REStlEW  OF  MARTINET'S  THEORIES  .  ...  131 

APPENDIX  H  RULES  OF  SOUND  CHANGE  AND  EUPHONIC 

COMBINATION . 143 

H.1  CHANGES  IN  PLACE  OF  ARTICULATION  .  .  143 

H.II  CHANGES  IN  MANNER  OF  ARTICULATION  147 

H.III.  RESONANCES .  153 

H.  IV  SOUND  DROP-OUTS .  155 

H.  V  A.  CHANGES  INVOLVING  PHONES  WITH  A 

SINGLE  PLACE  OF  ARTICULATION  ...  160 

B.  CHANGES  INVOLVING  PHONES  WITH 

TWO  PLACES  OF  ARTICULATION  ...  165 

H.  VI  SANDHI  RULES  OF  SANSKRIT  AND  THEIR 

APPLIC.\EILITY  TO  ENGLISH  169 

H.  VII  METHOD  AND  RULES  FOR  REPRESENTING 

EUPHONIC  RULES  SYMBOLICALLY  ....  173 

H.  VIII  RULES  OF  SOUND  SHIFT  DERIVED  FROM 

MARTINET'S  THEORY . 185 

H.IX  FURTHER  RULES  OF  SOUND  CHANGE.  .  ..186 

APPENDIX  I  VISARGA  VOWELS . 189 

APPENDIX  J  THE  IMPORTANCE  OF  THE  VOCODER  IN 

ACOUSTIC  AND  PHONETIC  RESEARCH  ...  192 


■ 

SUBJECT  TITLE  PAGE  NUMBER 

APPENDIX  K  THE  DEVELOPMENT  OF  MACHINES  FOR 
"  SPEECH  PERCEPTION . 200 

APPENDIX  L  RELATION  OF  AVAILABLE  INFORMATION 

TO  ACOUSTIC  CORRELATES  OF  SPEECH  .  .  204 

APPENDIX  M  MEASUREMENTS  MADE  FOR  DETERMINING 

ACOUSTIC  CHARACTERISTICS  OF  SPEECH.  .  207 

APPENDIX  N  A  NOTE  ON  THE  MEASUREMENTS  PERFORMED 

ON  THE  LEHISTE  -  PETERSON  DATA  .  .  .  208 

APPENDIX  O  RULES  OF  EUPHONIC  COMBINATION 

SUPPORTED  BY  ACOUSTIC  MEASUREMENTS  .  209 


APPENDIX  P  RULES  OF  EUPHONIC  COMBINATION  AND 

PHONETIC  TRANSCRIPTION . 213 

APPENDIX  Q  DISCUSSION  OF  THE  W  PHONE  CLASS  .  .  .  222 

BIBLIOGRAPHY . 223 


LIST  OF  FIGURES 


NUMBER 


TITLE 


PAGE  NUMBER 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 
1 1 

12 

1  3 

14 

15 

10 

17 


Time- Amplitude  Waveform  for  h 

Time-Amplitude  Waveform  for  f 

Time -Amplitude  Waveform  for  m 

Paletogram  of  tli 

Palctogram  of  s,  z 

Time- Amplitude  Waveform  of  sh 

Time-Amplitude  Waveform  of  cli 

Diagram  of  onglide,  steady- state ,  and 
offglide  portions 

Spectrogram  of  Southern,  English,  and 
General  American  Speech 

Diagram  of  Spectral  Areas  of  ''pit" 

Chari  Comparing  the  Duration  of 
portions  of  "bite,  "  in  a  .Southern,  British, 
and  General  American  Pronunciation 

Time- Amplitude  Waveform  Illustrating  a 
Method  for  Intensity  M.easurement  and 
Definition 

Representation  of  Speech  in  Phonetic 
.Symbol  s 

Representation  of  .Sounds  Ae.cording  to  a 
Center  of  Gravity  ilypotiu’sis 

Grouping  Souiuls  A.'cording  to  Moveable 
Planes  of  Arlieii lation 

Diagr.'.iii  of  Tongue  Position  Cor  Articulation 
of  t  in  "tick" 

Cliarl  of  Articulation  of  American  Consonant 
Plionemes 


f  4a 
14a 
14a 
14a 
14a 
1 4a 
14a 
24a 

24b 

32a 

32a 

32a 

36a 

42a 

42a 

4  2a 

48a 


18 


Survey  of  .Speech  Recognition  Activity 


64  a  .  b 


i 

I''' 

k 


r 


fe' 


i 

f 


i 


FIGURK 

TITLE 

PAGE  NUMBER 

19 

w^pllcation  of  the  Locus  Theory 
to  Natural  Speech 

66a 

20 

Application  of  the  Locus  Theory  to 

Three  Vowels  in  Natural  Speech 

66b 

21 

Spectrogram  of  "We  have  no  wax" 
by  Speaker  1 

82a 

22 

Spectrogram  of  "We  have  no  as" 
by  Speaker  1 

S2a 

23 

Spectrogram  of  "We  have  no  wax" 
by  Speaker  2 

82a 

24 

Spectrogram  of  ",  , ,  have  no  ax" 
by  Speaker  2 

82a 

25 

Spectrogram  of  "We  have  no  w.ax" 
by  Speaker  5 

82a 

26 

Spectrogram  of  "We  have  no  ax" 
by  Speaker  5 

82a 

27 

Spectrogram  of  "No  animal  has  three 
ears"  by  Speaker  1 

82a 

28 

Spectrogram  of  "It  lasted  three  years" 
by  Speaker  1 

82a 

29 

Spectrogram  of  "No  animal  has  three 
oars"  by  Speaker  5 

82a 

30 

Spectrogram  of  "It  lasted  three  years" 
by  Speaker  5 

82a 

31 

Spectrogram  of  "He  took  the  small 
kitten  homo  with  him"  by  Speaker  1 

84a 

32 

Time-Amplitude  Plot  of  Syllabic  n  in 
"He  took  the  small  kitten  home  with  him.  " 

84a 

33 

Time -Amplitude  Plot  of  Syllabic  n  in  an 
Articulation  of  "moon"  ~ 

84a 

34 

Spectrogram  of  "will"  by  Speaker 

88a 

35 

Spectrogram  of  "Will  you  help  us" 
by  Speaker  1 

88a 

t 


FIGURE  TITLE  PAGE  NUMBER 


36 

Spectrogram  of  .  .  .will  give  himself 
some  pains  to  observe. ..."  by  Speaker  1 

88a 

37 

Spectrogram  of  "will"  by  Speaker  2 

88a 

38 

Spectrogram  o£  ''Will  you  help  us?'*  by 
Speaker  Z 

88a 

39 

Spectrogram  of  ".  .  .  .  er  will  give  himself 
some  pains  t . "  by  Speaker  2 

88e 

40 

Spectrogram  of  "will"  by  Speaker  5 

88a 

41 

Spectrogram  of  "Will  you  help  us?" 
by  Speaker  5 

88a 

42 

Spectrogram  of".  .  .  .  er  will  give  himself 
some  pains  to  observe,  "  by  Speaker  5 

88a 

43 

Time -Amplitude  Plot  of  tla  of  "it  lasted" 
from  "It  lasted  three  years.  "  by  Speaker  5 

90a 

44 

Time-Amplitude  Plot  of  tB  of  "it  lasted"  from 
"It  lasted  three  years  by  Speaker  5 

90a 

45 

Time-Amplilude  Plot  of  kth  of  "took  the!' 
from  "He  took  the  small  kitten  home  with 
him.  "  by  Speaker  5 

90a 

46 

Time-Amplitude  Plot  of  a  "voiceless  nasal'* 
between  s  and  m  in  small 

90a 

47 

Spectrogram  of  "different  operations"  by 
Speaker  1 

90b 

48 

Spectrog'oi m  of  "fliffca-env  operations"  by 
Sp(;aker  2 

901) 

49 

SptM'lrogram  of  "difleriuit  operations"  by 
Speaker  5 

90b 

50 

.Sped rogi’ani  of  "observation  t)f"  by  Speake.i 

1 

90b 

51 

Specti*ograii\  of  "observation  of"  by  Speaker 

2 

90b 

52 

Spectrogram  of  "observation  of"  by  Speaker 

5 

90b 

53 

Time -Amplitude  Plot  of  "observe"  by 
Speaker  1 

90e 

54 

Time- Amplitude  Plot  of  "observation"  by 

90e 

Speaker  I 


FIGURE 


TITLE 


PAGE  NUMB  ER 


55 

Time -Amplitude  Plot  of  "observe"  by 

Speaker  2 

90e 

56 

Time -Amplitude  Plot  of  "observation"  by 
Speaker  2 

90e 

57 

Spectrogram  of  "some  ancient  sage.  " 
by  Speaker  1 

92a 

58 

Spectrogram  of  "some  ancient  sage" 
by  Speaker  2 

92a 

59 

Spectrogram  of  "soundness  or  rottenness" 
by  Speaker  1 

92a 

60 

Spectrogram  of  "soundness  or  rottenness" 
by  Speaker  2 

92a 

61 

Spectrogram  of  "soundness  or  rottenness" 
by  Speaker  5 

92a 

6Z 

Spectrogram  of  "the  human  mind"  by 

Speaker  1 

92a 

61 

Spectrogram  of  "the  human  mind"  by 

Speaker  2 

92a 

64 

Spectrogram  of  "the  human  mind"  by 

Speaker  5 

92b 

65 

Spectrogram  of  "Mrs.  Slipslop"  by 

Speaker  1 

92b 

66 

Spectrogram  of  "Mrs.  Slipslop"  by 

Speaker  2 

92b 

67 

Spectrogram  of  "Mrs.  Slipslop"  by 

Speaker  5 

92b 

68 

Time -Amplitude  Plot  of  "which"  by 

Speaker  1 

92b 

69 

Spectrogram  of  "which"  by  Speaker  1 

92b 

70 

Spectrogram  of  "which"  by  Speaker  2 

92b 

71 

Spectrogram  of  "which"  by  Speaker  5 

92b 

72 

Diagram  of  Speech  Production  and  Perception 
Process 

100a 

FIGURE 

TITLE 

PAGE  NUMBER 

Sc- 

73 

Time- Amplitude  Plot  of  £-c  from 

206a 

- 

"three  years"  by  Speaker  5 

74 

Time -Amplitude  Plot  of  e-y-e  from 
"three  years"  by  Speaker  S 

206a 

75 

Diagram  of  Air=Flow  for  Visarga  Vowels 

190.a 

- 

76  . 

Diagram  of  Frequency  Characteristics  of 
Visarga  and  Norma!  Vowels 

190b 

LIST  OF  TABLES 

TABLE  TITLE  PAGE  NUMBER 


1 

Comparison  of  Duration  Changes 
for  "no  ax"  -  "no  wax" 

82c 

2 

Comparison  of  Frequency  Changes 
for  "no  ax"  -'ho  wax" 

82c 

3 

Comparison  of  Duration  Changes 
for  "tnree  ears''  -  "three  years" 

82d 

4 

Corhparison  of  Frequency  Changes 
for  "tnree  cars"  -  "tnree  years" 

82d 

5 

Duration  in  Milliseconds  for  the  word 
"will"  in  three  different  environments 

88b 

36 

Table  of  .  .  cr  will  give  himself  some 
pains  to  observe.  . .  "  by  Speaker  1 

88b 

3V 

Table  of  .  er  will  give  himself  some 

pains  t.  , , .  "  by  Speaker  2 

88c 

40 

Table  of  "will"  by  Speaker  6 

88c 

41 

Table  of  "Will  you  help  us?"  by  Speaker  5 

88c 

42 

Tabic  of  ". . ,  er  will  give  himself  some 
pains  to  observe,  "  by  Speaker  5 

88d 

47 

Table  of  "different  operations"  by 

Speaker  1 

90c 

48 

Table  of  "different  operations"  by 

Speaker  2 

90c 

4V 

Table  of  "different  operations"  by 

Speaker  5 

goc 

30 

Table  of  "observation  of"  by  Speaker  1 

90d 

52 

Table  of  "observation  of"  by  Speaker  5 

90d 

TABLE' 


TITLE 


PAGE  NUMBER 


57 

Table  of  "some  ancient  sage.  " 
by  Speaker  1 

92c 

58 

Table  of  "some  ancient  sage" 
by  Speaker  2 

92c 

59 

Table  of  "soundness  or  rottenness!' 
by  Speaker  1 

92c 

60 

Table  of  "soundness  or  rottenness" 
by  Speaker  2 

92d 

61 

Table  of  "soundness  or  rottenness" 
by  Speaker  5 

92d 

62 

Table  of  "the  human  mind"  by 

Speaker  1 

92d 

63 

Table  of  "the  human  mind!'  by 

Speaker  2 

92e 

64 

Table  of  "the  human  mind"  by 

Speaker  5 

92e 

65 

Table  of  "Mrs.  Slipslop"  by 

Speaker  1 

92e 

66 

TablK^f'Mrs.  Slipslop"  by 

Speaker  2 

92e 

67 

Table  of  "Mrs.  Slipslop"  by 

Speaker  5 

92f 

69 

Table  of  "which"  by  Speaker  1 

92f 

70 

Table  of  "which"  by  Speaker  2 

92f 

71 

Table  of  "which"  by  Speaker  5 

92  f 

72 

Information  on  the  occurrence  of 
rules 

98 

preface 


Iri  order  to  provide  a  theoretical  basis  for  a  genera)  purpose 
recognizer,  we  have  investigated  the  possibility  of  organizing  the 
information  available  on  the  various  aspects  of  speech  into  a 
multidimensional  model,  based  upon  genetive  linguistic,  phonetic,  and 
acoustic  considerations.  We  have  thus  attempted  to  establish  an 
orderly  method  for  representing  speech  sounds.  This  orderly  method 
is  unique,  we  believe,  for  while  it  can  readily  be  used  to  describe 
the  sounds  that  occur  in  carefully  and  discretely  articulated  speech, 
it  c.an  also  provide  a  basis  for  a  recognition  program  for  imperfectly 
articulated  continuous  speech.  For  example,  the  recognition  of 
bet  you  (bcchyou)  in  continuous  speech  utilized  information  which  is 
similar  to  that  previously  known:  the  representation  of  be  and  the 
representation  of  chew  .  It  is  our  rules  of  euphonic  combination  whic  h 
bridge  the  gap  between  what  was  previously  known  about  discrete  speech 
(be  and  chew  )  and  what  we  have  discovered  about  continuous  spc'ech 
(bcchyou),  by  indicating  that  such  a  modification  of  discrete  speech  is 
likely  to  occur  in  continuous  speech. 

Instead  of  undertaking  the  formidable  task  of  examining  vast 
samplings  of  continuous  speech,  we  have  constructed  our  model  on  the 
basis  of  existing  literature.  The  physical  basis  of  articulation  lias  been 
and  is  currently  being  investigated  thoroughly  by  other  researchers. 

For  the  most  part,  our  explanations  of  the  ph./sical  production  of 
sounds  concur  with  widely-accepted  descriptions:  our  one  exception 
(and  thus  our  major  contribution)  to  this  de.se  riplion  is  our  emphasis 
on  the  disti.ict.ive  nature  of  continuous  speech.  Existing  theories  suggest 
that  normal  speech  can  be  re(l,u:ed  to  its  scientiCic  essentiais  by  studying 
the  production  of  individual  sounds,  and  then  combining  sound.s  in  an 
additive  fashion.  'I'liat  is,  by  forcing  air  through  the  articulators, 
sound  is  produced;  chivnging  the  position  of  the  arlicuialors  changes 
the  acoiistie  properties  of  the  sound.  Thus  for  eaeh  arrangeniciil  of  llie 
articulators  by  a  particular  person,  tliore  eorn  sponds  a  sound  of 
reasonably  dislinel  aeoustic  properties. 

We  contend,  however,  that  one  cannot  merely  us^'  the  sum  of  a 
sequence  of  separately  -  protliiccd  sound.s  to  describe  what  happens  in 
normal  or  continuous  speech.  For  speech  docs  not  consist  merely  of 
placing  the  articulators  iu  a  position  which  is  fixed  for  a  purlicular 
sound,  and  forcing  air  through  them,  one  breath  for  each  sound.  Instead, 
the  air  is  forced  through  continuously,  and  the  articulators  .are  constantly 
moving  from  one  position  to  another,  making  a  continuous  flow  ot 
sounds,  it  is  then  important  to  ri'cogni/.c  that  in  continuous  speech, 


-1- 


sounds  can  easily  modify  surrounding  sound,  so  that  the  waveform  of 
a  sound  produced  in  continuous  speech  can  differ  significantly  from  the 
waveform  of  that  sound  pronounced  in  isolation.  In  continuous  speech, 
sounds  can  be  eliminated,  added,  added  together  or  substituted  for 
one  another. 

Thus  the  matter  of  articulation,  when  applied  to  continuous 
speech  is  intimately  connected  with  the  phenomenon  of  sound  change. 

It  is  on  this  basis  that  we  undertook  an  historical  survey  of  sound 
cliangc  in  various  Indo-European  languages,  which  utilize  the  same 
physical  modes  of  production.  This  study  is  presented  in  Sections  1-3 
of  this  report.  From  this  body  of  linguistic  research,  we  collected 
several  hundred  tentative  "rules"  of  sound  change,  assuming,  by 
analogy  with  the  Ergodic  theory  of  physics,  that  all  the  sound  changes 
which  have  occurred  in  the  historical  development  of  languages  arc 
being  duplicated  today,  .at  a  particular  moment  in  a  given  language. 

We  do  not,  however,  accept  such  "rules"  as  final,  until  their  occurrence 
in  rnodernduy  English  has  betm  substantiated  by  examining  samples  of 
continuous  speech. 

Phonetic  analysis  is  a  tool  of  the  linguist,  and  c<in  be  used  only 
to  <as<  ertuin  how  t  ontinuous  speech  is  p(;rceLv<‘d  by  the  human  ear 
(i.  e.  whether  or  not  words  actually  do  or  do  not  contain  the  sounds 
indicated  in  their  orthography).  As  such,  liowever,  it  provides  a 
wortliwhile  indicator  of  the  acoustic  discrepancies  which  may  occur  in 
continuous  speech. 

The  final  analysis  and  criterion  must  be  acoustic,  however,  for 
il  is  ll."  acoustic  wa\el'.)."m  which  mn.st  he  nnde r. stood  by  a  speech 

•  i:.,-.. z,  r.  c’  /r  this  r<  ason,  .Section  •!  of  llii.s  report,  which  presents 
,11  oustic  .\idenee  to  jii-.lify  of.r  irealineiil  of  speech  information,  might 
be  eonsidi'ied  an  essential  contribution  i  t  this  study. 

Our  work  in  this  study  lias  hei  n  limited  to  tixatnining  the  char.u.- 
Lrristii  s  of  those  sounds  usually  classified  as  cuiisonants,  (We  do, 
him  ever,  oiate  several  general  ohser\'alions  and  reeciniineiulations  tihout 
vowel  1  n  aliiienl,  although  tlu’  vowels  were  not  studied  in  depth.)  In 
seckhit;  lo  pro' ide  an  orderly  means  c>f  representing  l  onsonants,  we  Inave 
re. u  hell  seser.'.l  .n.ijor  lonclusions:  (I  )  llie  idiaraeler  of  a  given  eun- 
s.miiil  -  its  pla>  e  or  .iianiier  of  artieul.ition,  anil  thus  its  acoustic  repre¬ 
sentation  -  changes  ai. cording  to  the  sound  wdiltdi  precedes  or  follows  it, 
(il)  Ear  this  leason,  so-called  "consonant  clusters"  should  he  treated 
as  imiipn  entities,  not  as  the  adilition  of  two  or  more  fixed  sounds, 

{'Ihis  is  amplified  in  on  r  tliseussion  on  signientation  in  .lection  d.  )  (  i)  If 

1  wnsou.iiiL  i  lusters  .ire  tri-aletl  .is  special  consonants,  then  speech  can  he 


divided  into  segments  consisting  of  "consonant-vowel"  combinations, 
including  the  onglide  and  offglide  transitions  to  make  recognition 
more  precise. 

The  multidimensional  model  for  speech  recognition  is  thus  an 
ordered  manner  of  representing  the  various  classe*  of  consonants  in 
such  a  way  that  a  shift  or  drift  of  consonants  to  another  class  can  be 
accounted  for.  The  body  of  rules  of  sound  change  or  Euphonic 
Combination,  as  we  shall  show,  can  be  represented  in  symbolic  form 
suitable  for  computer  programming.  Our  model  plus  the  rules  of 
euphonic  combination  thuSipspresents  an  error-correcting  code  for 
speech  recognition.  For  since  the  same  degrees  of  freedom  exist  - 
within  the  realm  of  physical  possibility  and  necessity  -  we  can  predict 
the  mistakes  which  may  occur.  We  are  thus  operating  by  analogy  v/ilh 
the  Ergodic  theory  of  physics,  rather  than  following  the  hypothetical 
v-onstructs  of  linguistics,  which  are  at  times  contradictory  and  often 
unorganized. 

Furthermore,  our  work  has  suggested  segments  which  are 
better  suited  for  recognition  by  the  perceive!  than  either  phonemes  or 
words.  And  finally,  our  work  has  indicated  the  importance  of 
including  such  aspects  of  speech  as  intensity  and  duration  as  consider¬ 
ations  necessary  for  the  segmentation  of  speech.  Additional  research 
in  these  areas  seems  advisable. 

It  may  be  noticed  that  certain  of  the  concepts  presented  in 
this  reportwi.il  be  familiar  to  the  reader.  We  include  -such  information 
for  several  reasons; 

1)  to  slate  a  common  background  and  lo  provide  information  for 
those  readers  not  specializing  in  any  one  ut  these  aspects. 

to  provide  detailed  descriptions  of  our  assumptions  and  thus 
to  indicate  the  extent  of  applicability  of  mir  innilint!  and  our 
results. 

d)  to  order  information  which  lias  been  previously  available 
from  various  sources,  liut  which  has  never  been  presented 
ill  an  organized  form  in  Llie  published  literature. 

4)  lo  describe  and  to  explain  our  method  of  ordering  sounds,  and 
to  justify  our  positioning  of  sounds  in  tlie  nuiltidiinensional 
model. 

5)  to  ascertain  tlial  the  Just  critics  can  find  constructive  aspects 
.and  that  professional  critics  can  be  credited  with  justifiable 
coniineiits. 


The  individual  treatment  of  the  work  done  by  western  linguists, 
of  sound  change,  of  the  concept  of  the  multidimensional  model, 
and  of  the  acoustic  evidence  substantiating  this  concept  has  necessitated 
a  certain  amount  of  repetition;  such  repetition  is  necessary  for  the  sake 
of  clarity.  In  order  to  organize  and  consider  all  the  necessary  aspects 
of  speech,  we  have  been  denied  a  study  in  depth  of  several  areas  where 
such  study  seems  advisable.  Our  thoroughness  has  been  to  include  all 
aspects,  rather  than  to  examine  certain  of  these  aspects  in  great  detail. 
Furthermore,  a  historic  review  was  necessary  for  several  reasons; 

(1)  To  point  out  difficulties  of  definition  which  have  confused 
previous  research  in  this  field.  One  serious  example  of 

unclear  definition  is  the  historic  use  of  the  term  "phoneme.  " 

(2)  To  place  our  work  in  perspective  with  contemporaries,  and 
to  clarify  the  stand  taken  in  other  published  work. 

(3)  To  evaluate  which  concepts  in  the  published  literature  were 
irrelevant,  and  to  determine  which  of  these  concepts  could  be 
adapted  to  suit  our  present  needs. 

Finally,  qualifioations  of  certain  other  aspects  of  our  researcli 
must  be  touched  upon. 

While  plac(!  and  manner  of  articulation  are  used  in  defining 
phones,  the  acoustic  waveforms  are  not  said  to  depend  on  these  aspects 
alone  ;  this  is  recognizi:d  by  our  CV  ordering  of  related  aspects,  Thu 
principal  reason  for  their  consistent  use  is  the  need  for  ordering 
inf  irination  about  phone  combination  which  is  available  in  linguistic 
literature  where  such  nomenclature  originated. 

Our  study  introduces  the  importance  of  prosodic  features  of 
speecli,  such  as  duration  and  intensity,  wlui:h  have  bi,:en  too  ofti'ii 
ignored  in  work  on  automatic  speecli  recognition,  and  wliich  are  not 
ever  considered  distinctive  features  wlie'ii  they  actually  do  jirovide  new 
diifei'enlia,  as  in  balm  and  bomb . 

The  role  of  pitc  h,  intensity,  lime  jiorniall/.aliou,  ele,  ,  is  left 
in  a  tlieoretieal  stale  primarily  bec  ause  of  the  need  for  depth  studies 
ill  c'ac  h  of  lliese  areas  for  additional  discussiuii.  Moreover,  the 
available  information  defines  llie  situation  only  to  the  degree  of 
justifying  their  inclusion  as  dimensions  in  llie  model. 

It  is  worth  noting,  finally,  that  the  resuils  of  a  recent  study, 
"Dimensions  of  Ferception  of  Ccinsoiiaiits"  was  pniblished  by  Robert 
W,  Pc'ters  in  the  December,  1963  Journal  of  the*  Acoustieal  Society  of 
Amejuca  (35:12,pp.  I9S5  -  9).  in  aiialyaini’  the  psychological  "distance  " 


-IV- 


between  consonants,  it  was  reported  that  certain  dimensions  could 
be  ordered  according  to  their  importance  in  identifying  consonants: 
manner  of  articulation  was  judged  the  most  important  dimension,  followed 
by  voicing,  and  then  place  of  articulation,  in  that  order. 

We  agree  that  these  dimensions  are  important  and  necessary  to 
the  identification  of  consonants;  indeed,  these  dimensions  are  included 
as  the  major  criteria  for  categorization  in  eac'i  plane  of  our  multidimen¬ 
sional  model. 

Section  1  of  this  report,  reviews  the  linguistic  problems  involved 
in  a  study  such  as  this;  Section  i  outlines  the  form  of  the  Multidimensional 
Model;  Seetinn  3  considers  sound  changes  derived  from  our  linguistic 
phonetic  and  genitive  research  in  lerms  of  the  model.  It  is  in  Section  3 
(and  Appendix  H)  that  wc  propose  a  body  of  rules  for  euphonic  com¬ 
bination  in  continuous  speech. 

.Section  4  presents  the  acuuslic  data  -  the  measurements  madi'  on 
portions  of  the  speeeli  waveform  in  ordirr  to  test  the  validity  of  our  approach. 

Section  5  suggests  a  method  of  segmenting  continuous  speech  into 
units  which,  when  used  in  combination  with  Hie  stored  rules  for  euphonic 
combination,  are  suitable  for  computer  processing.  In  tlial  section 
we  also  perform  a  cursory  examination  of  the  frequency  of  occurrence) 
for  various  rules  of  euphonic  combination,  and  explore  various  techiiic|ues 
of  computer  formatting  wliUh  could  be  used  to  match  ccniinuoiis 
speech  to  orthographic  script. 


-  v  - 


MULTIDIMENSIONAL  MODEL  FOR  AUTOMATIC  SPEECH  RECOGNITION 

BY 


B.  V.  BHIMANI 
SECTION  1 

LINGUISTIC  ASPECTS  OF  SPEECH 
GENERAL  PROBLEMS  OF  AUTOMATIC  SPEECH  RECOGNITION 

The  basic  purpose  of  this  study  is  to  define  an  orderly  set  of  rela¬ 
tionships  that  exist  between  speech  patterns  as  they  are  physically  produced 
and  the  sound  of  speech  as  it  is  pt^rceived  by  human  and  mechanical  meaRs. 

It  is  our  thesis  that  the  sounds  of  hiiman  speech  are  not  random  phenomena, 
arbitrarily  measured.  Rather,  there,  is  a  direct  coiaiection  betvv(’c*n  the 
way  speech  is  produced  within  the  physical  limits  of  choice  available,  and 
I  the  sounds  that  the  speaker  produces.  As  a  thtiorelical  and  piacLical  aid 

j  in  analyiiing  the  related  data  of  phonemic  combination  and  acoustical  perception 

we  introduce  the  eoncev^t  of  a  mulli-dirnensional  model  or^ani/.i'd  according 
j  to  the  physical  freedoms  a  speaker  may  exercise  in  articulating  his  sounds 

and  their  interaction  with  each  olh<’r. 

Our  study,  it  should  be  emphasized,  repri'sents  both  a  synthesis  of 
^  past  works  in  phonetics,  linguistics,  and  acoustics  and  the  first  general 

I  outlin<*  of  what  has  be<*n  sUulied,  what  is  relevaiil;,  and  what  is  needed  in 

tin;  broad  field  of  speech  riwognilion,  Tlu-  work  of  contemporaries  in  Inves- 
*  tigiiling  phenomena  of  acoustics  of  speech  is  considered  as  it  mdati^s  to  our 

j  conct'pts. 

In  com)>ining  s\ich  data  we  do  not  assert  lliat  speei  h  I'an  be  defined 
and  measured  according  to  rigid  rules  of  phoiietic  eumhinalion,  Kalher 
our  problem  is  tlu*  relative*  freedom  an  individual  speaker  has  to  vary 
the  sound  of  his  words  aiul  still  make  them  reeogni zabli*  to  the  human  ear. 

A  model  for  sp«*ecli  rec  ognition  naisl  luive  categories  i'tMui)reheusi ve  imougii 
I  to  iuchidc  lh<*se  variations.  It  is  particuUirly  for  this  ri'ason  Dial  we  must 

study  the*  i.ihysical  causes  of  such  variations,  and  int*ayvn*e  sound, s  liv  unils 
which  will  allow  I’nc  greatest  p()ssil)le  freedom  :.n  identifying  re  lated  and 
unndaleel  phones. 

As  ail  aid  to  unelerstanding  this  api>i*oach  we-  include  first  a  review 
of  IItc  iustory  of  phonetics  and  phonemii  s.  A  ge’ucral  baekgruund  for  our 
concepts  is  providetl  also  by  the*  (diarU:  illustrating  the  inlcracticni  of  se-v- 
e'ral  ot  our  dimcn.sions,  llu*  ge'neral  ehscussion  ed'  the-  divisioiivs  wiiich  our 
model  make’s  and  the  rease)ns  fe>r  making  them;  and  the  spccil'ie  analysf/y 
of  the  probU’ms  involveel  in  classifying  and  measuring  eae  of  llu-  elinu-nsions  -• 
manne-r  of  articulation,  place  of  articulation .  resonaiuc,  vowels,  duration, 
intensily,  and  freque’iie-y. 

I 

I 


1 


In  this  discussion  we  have  constantly  related  our  formulation 
of  how  speech  is  produced  to  the  ways  of  measuring  speech  percep¬ 
tion,  ranging  from  the  human  ear  to  spectral  analyses  and  time  -  am.pli - 
tL  do  i.'aVeformB.  Thus  adequately  outlined  our  formulation  provides 
the  basis  for  more  detailed  analysis  of  the  problems  involved  in  both 
its  theoretical  concepts  and  in  its  possible  application  in  a  practical 
field  such  as  the  mechanical  development  of  a  general  purpose  speech 
transcriber. 

One  of  the  basic  problems  in  building  an  automatic  speech 
recogniner  is  to  decide  what  units  the  machine  should  recognize,  We 
can  choose  between  larger  units,  such  as  words  and  smaller  units,  such 
as  sounds.  When  we  consider  the  fact  that  the  English  language  has 
several  hundred  thousand  words ,  it  seems  impractical  to  build  a  word- 
recognizer.  Once  wo  have  made  the  decision  to  build  a  sound-recognizer, 
wc  niust  decide  how  it  will  rei  ognize  sounds.  It  has  been  suggested  that 
an  automatic  speech  recognizer  should  be  a  phoneme  recognizer,  We  do 
not  .agree  with  this  suggestion,  but  before  wc  ran  give  our  reasons,  we 
must  first  discuss  the  meaning  of  "phoneme.  " 

The  word  "phoneme"  is  currently  used  with  at  least  two  different 
meanings.  Sometimes  it  is  a  synomym  for  "speech  sound"  and  sometimes 
it  refers  to  a  class  of  speech  sounds.  It  is  primarily  linguists  who  use 
the  word  in  the  latter  se.nse,  and  since  tiiey  do  not  always  explain  the  term, 
it  i.s  frequently  difficult  for  those  who  have  not  read  widely  in  this  field 
to  follow  the  fiiK!  points  of  their  discus.sions. 

In  this  section  we  will  describe  the  linguist's  use  of  the  word  "phoneme." 
and  tile  concepts  which  underlie  it.  The  discussion  will  begin  witli  a  brief 
liisiory  of  liow  tlie  concepts  evolved  and  a  description  of  some  currently- 
lieid  theories  altoul  plionemes.  We  will  next  deseribe  tlie  leelmitiucs  of 
plionemic  .analysis.  Finally  we  will  discuss  tlie  relevance  of  the  phoneme 
to  automatic  speech  recognition  and  explain  wliy  we  lliink  some  other  .sound 
unil  sllould  be  used  tor  the  maeliine. 

I.  and  DESCKIPTION  of  phonemic  TIIEOKY 

Linguislies  ns  an  aeadiunie  discipline  began  in  the  early  part  of 
the  nineteenth  eentiiry  with  the  diseovei'v  that  there  were  regular  sound 
eorvespondenees  between  tlie  Germanic  languages,  such  as  Engiisli  and  other 
members  of  the  Indo-European  group,  sueli  as  .Sanskrit,  Greek,  and  Latin. 
Tliese  correspondences  were  of  the  ty|je,  Latin  £  corresponds  to  English  f, 
Latinj.  corresponds  lo  IcnglLsIi jJh.  Latin  £  (pronounced  k)  corresponds  to^ 
English  h^.  Some  worils  illusl  rating  these  eo  r  rcspondenc  es  are; 


2 


Latin  pater 
Latin  tu 
Latin  trcs 


English  father 
English  thou 
English  three 


Latin  centum  English  hundred 

The  linguists  who  discovered  these  correspondences  explained  them 
with  the  theory  that  most  of  the  languages  of  Europe  and  many  of  the  lang¬ 
uages  of  Asia  arc  descended  from  one  single  language  which  was  spoken 
at  some  time  in  prehistory.  This  language  was  given  the  name  Proto- 
Indo-European.  Since  it  was  spoken  in  prehistory  all  our  knowledge  of  it 
comes  from  comparing  the  languages  descended  from  it. 

In  comparing  these  languages  to  decide  whether  the  Proto-Indo- 
European  word  for  three  began  with  t  or  th  ,  the  linguists  concluded  that 
the  t  is  original  and  the  th  an  innovation,  because  only  the  Germanic  lan¬ 
guages  have  til.  (At  present,  only  two  of  the  Germanic  languages  have  th  , 
but  we  know  from  written  records  that  the  others  had  it  earlier.  ) 

The  discovery  that  those  sound  changes  had  taken  place  was  impor¬ 
tant  not  only  for  an  understanding  of  language  relationships,  but  also  for 
an  understanding  of  plionotics.  p,  I,  and  k  ari»  phonetically  similar^  ihey  are 
all  voiceles.s  stops,  f  .  tli  ,  and  h  are  also  phoncdieally  siitiilar;  they  arc  all 
voiceless  continuants.  Tliis  means  that  tlie  process  v/hereby  p  bccaini’  1  was 
identical  with  tlie  process  whereby  I  biu'aniu  th  ajicl  k  bi'came  h.  Tliis  suggests 
tiial  the  sounds  of  a  language  operate  as  a  system,  rather  than  independently 
uf  each  other. 


'rhere  are  two  other  imporlanl  sets  of  sound-changes  belweim  l^roLo- 
I.tulo-l’iuropean  and  early  Gernianii-.  'I'hey  arc  illustral(‘d  by  the  following 
wo  rds: 

!.,atir.  diu;  !\ngiii;!',  tsvo 


Latin  id 


JiJnglish  it 


Sanskrit  dha  Icnglish  do 

I’lu’se  cor respon(l(Mu:es  am-  summarized  by  the  statemeuls  that  tlie  Prolo- 
Indo-J''uro]>ean  voiced  unaspirated  stuns  d,  b,  and  g  became  the  Germanic 
voiceless  stops  p,  1,  and  k  and  that  the  Proto -tiulo - Eu ropean  voiced 
aspirated  stops  bh,  clli,  and  gh  became  the  Germanic  voii-eil  slops  b,  d,  and 
g  or  aspirants  ami  (The  exact  nature*  of  these  sounds  is  not 

clear  because  the  Gernianii*  languages  have  made  various  sound  changes  since 
tlieii.  11  is  true,  however,  that  SaJiskril  dh  usually  corresponds  to  modern 
English  d).  Again,  similar  sounds  underwent  similar  changt‘S. 


3 


The  discovery  of  these  sound-correspondences  gave  the  impetus  for 
further  research  into  the  sound-correspondences  among  the  Indo-European 
languages.  All  of  the  languages  descended  from  Proto-Indo-European  had 
made  some  sound  changes,  and  the  scholars'  problem  was  to  reconstruct 
the  original  language. 

In  comparing  words  to  discover  sound-correspondences,  the  scholars 
never  knowingly  compared  words  which  one  language  had  borrowed  from 
another.  Whenever  such  words  were  included,  the  results  disagreed  with 
the  correspondences  discovered  by  other  comparisons.  There  are  many 
words  besides  pater  and  father  which  show  that  Latin  p  corresponds  to 
English  f  ,  but  the  English  word  paternal  appears  to  show  that  Latin  £ 
corresponds  to  English  p.  The  explanation  is  that  paternal  is  borrowed  from 
Latin,  and  therefore  should  not  be  used  to  discover  the  sound  correspondence 
between  English  and  Latin.  The  task  of  establishing  the  correspondences 
was  complicated  by  the  presence  of  loan-words  which  were  not  recognized  as 
such,  and  by  tht^  fact  that  some  languagc.s  had  undergone  many  sound  changes, 
and  some  of  the  later  changes  obscured  tlie  effects  of  the  earlier  ones. 

There  were  three  types  of  sound-change  which  tliese  early  nineteenth 
century  linguists  recognized;  conditioned,  unconditioned,  and  sporadic. 

A  sound  cliango  which  had  taken  place  only  under  c(;rtain  circumstances  was 
a  conditioned  change.  Proto-Indo-European  t  became  English  Ih  everywhere 
except  where  another  spirant  or  sibilant  preceded  it,  TJius  the  t  in  Latin 
tu  corresponds  to  the  Ih  in  English  thou,  but  the  t  in  Latin  sto  corresponds 
to  the  I  in  English  stand  .  A  sound  change  which  takes  place  under  any 
circumstances  is  an  unconditioned  sound  change.  Proto-Indo-European 
o  bc'cainc:  a  in  the  Germanic  languages  in  all  positions.  Latin  toga  goc‘s 
back  to  the  s<ame  Prolo-Indo-Kuropean  word  as  Englisli  thatch;  the  word 
prcjbably  liad  the  original  iiH'iining  of  a  covering.  Tlu;  third  type  of  sound 
c  hange  which  the*  early  iiiiu*lc‘c-nth  century  linguists  consich'red  important 
was  sporadic  sound  cliangc.  As  th<‘  name  implies,  this  type  of  sound  change 
was  described  as  not  subjec  I  to  any  rules. 

In  Ihcr  ide.'i  of  sporadic  soiind  change'  was  attacked  by  a  group 

of  .sclioUirs  who  r.naintained  lluil  iill  souju.l  cluinge  was  rc'gular,  that  is  to 
say,  all  sound  changes  could  be  divided  into  two  groups,  those  which  Were 
ujicondii icjned  and  those  for  whicli  llie  conditions  could  be  clearly  slalc-d,  if 
all  nc!ccssary  data  were  available.  'I’liis  theory  wa.s  an  innovation  hec^ause 
up  to  tiuit  lime  the  most  widcs|;read  view  ribout  sound  change  wa.s  that  each 
word  had  itc  own  history;  the  new  theory  suggested  that  c'ach  sound  had'ils 
own  Instury,  This  lhc*ory  nc^t  only  had  immediate’  applications  to  the  problems 
of  reccmslrucling  Proto-Indo-European  but  it  also  had  a  long-range  effect 
c.)n  all  later  theories  about  how  sounds  function  in  language, 

At  about  the*  same  time  that  tlu’  theory  of  rc'gular  sou'nd  change  was 
finding  ac*ceplance,  linguists  also  began  to  take  interest  in  tlie  fact  that  the 


4 


sounds  used  in  human  speech  showed  much  more  variety  than  anyone 
had  given  them  credit  for.  The  impetus  for  this  came  from  the  dialect 
geographers,  who  were  making  maps  of  regions  or  even  whole  countries 
which  showed  just  how  the  speech  of  the  people  in  each  village  differed 
from  that  of  the  people  in  the  adjacent  village.  Once  they  became  aware 
of  the  great  phonetic  variety,  the  linguists  began  making  phonetic  trans¬ 
criptions  which  faithfully  recorded  every  variation  they  could  hear.  This 
approach  became  the  standard  practice,  and  since  it  is  true  that  no  two 
speech  events  by  the  same  speaker  are  absolutely  identical  and  many  of  the 
differences  are  great  enough  to  be  audible,  it  became  common  to  indicate 
in  the  transcription  that  a  given  individual's  utterance  of  the  word  "cat" 
on  Monday  differed  slightly  from  his  utterance  of  the  same  word  on 
Wednesday. 

There  w.as  .another  reason  for  variations  in  phonetic  transcriptions, 
although  most  of  the  phoneticians  of  that  time  did  not  realize  it.  No  human 
being,  however  carefully  trained,  can  function  as  an  automatic  transcribing 
machine.  The  listener  shows  variations  just  as  the  speaker  does.  On  one 
day  a  sound  might  seem  to  be  the  vowel  of  bet  somewhat  lowered,  and  on 
the  next  day  it  might  seem  to  be  tlic  vowel  of  bat  somewhat  raised.  The 
situation  is  not  improved  by  training  the  linguist  to  hear  seven  different 
steps  along  the  conlinuum  from  high  vowel  to  low.  Instead  of  transcribing 
a  sound  as  a  low  one  day  and  a  highJ^J  another  day,  the  linguist  trans¬ 
cribes  it  as  a  vowel  of  the  fourth  higlies  step  on  one  day  and  tlu!  fifth 
highest  on  another.  It  is  true  that  some  linguists  are  almost  perfectly 
consistent  in  (heir  transcriptions,  but  this  consistency  appears  to  l)e  an 
inborn  gift  rtithcr  than  something  wliich  can  be  tauglit. 

The  extremely  detailed  transcriptions  produced  by  atte-mpting 
phonetic  accuracy  wertr  not  easy  to  work  witli.  Wliile  it  was  true  that 
the  word  "eat"  was  luwef  pronounced  exactly  llie  same  way  twice,  it  was 
also  true  that  it  was  always  recognizable  as  "cat"  and  not  eonfusml  witli 
"hat"  or'lgat"  or  "at."  Tl  became  clear  tiiat  som.!'  phonetic  differtinccB 
Ijlayed  an  imiiorl.inl  role  in  .a  given  language  while  others  were  urreievanl. 
Willi  this  realization  came  the  newd  for  lerminology  which  would  make  it 
easy  to  talk  about  the  disliiu  lion  between  relevant  and  irrelevant  differences 
It  wa.s  at  this  point  that  the  word  "plioneme"  came  into  common  use,  to  refer 
to  a  group  of  sounds,  rliffereiiees  among  which  wcri'  irrelevenl. 

Although  it  h.ts  been  widely  used  for  about  forty  years,  there  i.s 
still  no  agreinnent  on  pi’eefsi'ly  liow  to  define  tlie  tirrm  "phoneme.  "  Tins 
is  not  to  say  that  no  definitions  h.ive  been  piroposed;  there  have  been  many 
definitions  .ind  iiuu li  cliseussion,  but  tlu^y  have  not  let  to  unanimity. 


The  different  definitions  of  the  phoneme  which  have,  been  proposed 
fall  into  four  main  groups.  Those  of  the  first  say  that  the  phoneme 
is  a  psychological  entity;  it  is  a  physical  entity  to  those  of  the  second; 
and  it  is  both  a  physical  and  a  psychological  entity  to  those  of  the  third; 
whereas  those  of  the  fourth  say  that  phoneme  is  a  class  of  sounds,  but 
they  do  not  ascribe  physical  or  psychological  reality  to  this  class.  The 
linguists  who  subscribe  to  definitions  of  the  fourth  type  do  not  flatly 
state  that  the  phoneme  has  no  physical  or  psychological  reality;  they 
say  that  there  is  not  sufficient  proof  on  this  point,  and  until  proof  is 
forthcoming,  it  is  better  not  to  make  unnecessary  assumptions. 

It  should  be  noted  that  although  there  is  disagreement  on  how  to 
define  the  phoneme,  it  is  possible  and  even  common  for  linguists  who 
define  it  differently  to  arrive  at  the  same  phonemic  analysis  of  a  given 
set  of  data.  This  is  because  the  techniques  of  phonemic  analysis  arc 
quite  similar,  no  matter  what  it  is  that  the  linguist  thinks  he  is  analyzing. 

Tills  is  not  to  say  that  all  phonemic  analyses  of  the  same  body  of 
data  will  be  the  same;  they  will  not  be.  However,  the  differences  cannot 
be  wholly  attributed  to  differences  in  phonemic  theory.  Yuen-Ren  Chao 
gives  the  following  list  of  criteria  winch  are  used  for  phonemic  analysis 
(Chao,  1934). 

(1)  ))honetic  accuracy,  or  smallness  of  range  of  phonemes 

{Z)  simplicity  or  symmetry  of  phonetic  pattern  for  the 
whole  language 

(3)  parsimony  in  the  total  number  of  plioninnes 

(4)  regard  for  the  I'l'cling  of  Hie  native  speaker 

(5)  regard  tor  etymology 

(6)  mutual  e,\clu.siveness  between  pliorienies 

(7)  symholie  reveraibi  lily  (that  i.s  to  say,  given  any  plioneinic 
symbol  in  a  language,  the  range  of  .sounds  it  ri’prescmls 
is  determined;  given  any  sound  in  the  language,  its 
plioneinic  symbol  is  di-terniiiu'd). 

Cliao  points  out  that  dilfereiil  liiiguist.s  don't  attach  the  same  wiuglit  to 
these  erileria.  Some  of  the  crilc■ri^l  such  as  (9)  are  especially  important 
to  those  who  define  the  plioiieinc  as  a  psychological  entity;  some,  such  as 
(I).  are  espeeially  iinporlaiit  to  those  wlio  define  the  plionenie  as  a  physical 
entity.  J’here  are  others,  however,  sueli  as  (6)  which  .are  not  weighted  by 


fi 


one  or  another  of  the  four  definitions  listed.  This  means  that  differences 
in  analysis  of  the  same  data  arise  from  different  definitions  of  the  phon¬ 
eme  and  from  different  criteria  of  phonemic  analysis. 

n.  TECHNIQUES  OF  PHONEMIC  ANALYSIS 

Despite  the  disagreement  about  defining  the  phoneme  and  about 
the  criteria  for  phonemic  analysis,  most  linguists  use  similar  procedures 
when  making  a  phonemic  analysis,  and  they  come  up  with  similar  results. 
The  first  step  in  phonemic  analysis  is  to  make  a  phonetic  transcription, 
but  there  is  a  problem  c..mnectcd  with  this. 

As  the  linguist  listens  carefully  to  a  native  speaker,  there  seems 
to  be  an  enormous  number  of  different  sounds.  Since  he  knows  that  it 
is  highly  unlikely  that  two  speech  sounds  will  be  physically  identical 
he  is  not  surprised  at  the  number  of  differences  he  hears  but  it  poses 
a  serious  problem  in  transcribing  and  analyzing  the  utterances.  The 
customary  solution  is  for  the  linguist  to  note  only  those  differences  which 
he  can  analyze.  This  means  that  lie  uses  the  same  symbol  to  transcribe 
two  sounds  which  he  knows  are  different.  .Since  they  arc  different,  they 
cannot  be  called  the  same  sound,  but  it  is  convenient  to  have  some  terra 
to  refer  to  all  sounds  v.'hich  arc  transcribed  with  the  same  symbol.  We 
shall  use  the  term  phone  class  .  It  should  be  noted  that  the  phono  classes 
<  £  one  linguist  will  not  necessarily  coincide  with  those  of  another.  The 
number  and  extent  of  the  phone  classes  of  any  linguist  depend  on  his 
training,  on  his  inborn  ability  to  analyze  sound  differences,  and  differences 
that  are  relevant  in  his  native  language. 

li  a  linguist  transcribes  very  few  phone  ela.sses  because  he  can 
analyze  few  differences,  he  may  have  difficulty  making  a  phonemic 
analysis.  If  he  places  two  sounds  with  relevant  (phonemic)  differences 
in  the  same  phone  class,  ho  is  in  trouble,  It  is  essential  for  phonemic 
analysis  that  all  sounds  which  are  phonemically  different  shall  have 
different  phontdic  transcriptions.  It  is  for  this  reason  that  the  linguist 
notes  all  the  differences  he  can  analyze;  he  does  not  know  which  are 
phon(UTiic  and  which  are  not. 

There  are  two  types  of  non -phonemic  variation.  One  type  results 
from  the  tact  that  two  speech  sounds  are  almost  never  physically  identical. 
The  variations  of  this  type  are  minor.  The  second  type  reculls  froin  the 
fact  that  it  is  common  in  language  for  a  phoneme  to  have  quite  different 
phonetic  realizations  in  diffiMuuil  luivironmenls;  these  arc  called  allo- 
phones.  In  English  the  k  of  key  is  different  from  the  k  of  coo,  although 
they  both  belong  to  the  /k/  phoneme.  The  k  of  key  has  its  place  of 
articulation  at  the  front  part  of  the  /k/  range,  because  the  following 
vowel  is  a  froni  one.  The  k  of  coo  has  its  place  of  articulation  at  the 


7 


back  part  of  the  /k/  range,  because  the  following  vowel  is  a  back  one. 

These  two  Qt)  'a  belong  to  the  same  phoneme  in  English  because  they 
are  phonetically  similar  and  do  not  contrast.  There  are  no  minimal 
pairs  identical  except  that  one  word  has  the  [k]  of  key  where  the  other 
has  the  [lO  of  coo.  In  Rumanian,  however,  these  two  'W  •  s  belong  to 
different  phonemes,  and  there  is  at  least  one  minimal  pair,  identical 
in  every  respect  except  that  one  word  has  the  front  Ckl  where  the  other 
has  the  back  [kj  .  The  words  are  £u  'with'  and  chiu  'cry'.  The  letter 
c  is  used  to  spell  the  front  Ckl  . 

In  Appendix  Awe  analyze  a  small  amount  of  plionetic  data;  this 
is  to  show  how  a  linguist  decides  which  differences  are  phonemic  and 
which  are  non-phonemic.  As  Appendix  A  shows,  the  Vietnamese  ^k^ 
and  [P]  aiternate  according  to  the  adjacent  vowel.  Just  as  the  two  Dc]  's 
do  in  English.  In  another  language  which  has  these  two  sounds,  they  may 
belong  to  different  phonemes.  We  cannot  predict  these  matters  from  one 
language  to  another.  The  phoneme  is  a  structural  unit  of  language,  and 
the  phonetic  shape  of  each  phoneme  must  be  determined  for  each  language. 

In  this  section  we  have  described  how  sounds  function  in  relation 
to  language;  we  shall  next  consider  briefly  the  question  of  how  an  auto¬ 
matic  speech  recognizer  should  function  in  relation  to  speech  sounds. 

Any  design  for  an  automatic  speech  recognizer  must,  first,  be  feasible  from 
the  engineering  point  of  view,  and,  second,  must  distinguish  between  all 
utterances  which  are  different  in  the  language  being  analyzed.  It  should 
be  noted  that  the  second  condition  states  only  that  a  speech  recognizer 
should  distinguish  between  all  different  utterances;  we  do  not  stipulate 
that  it  must  ignore  non-phoneme  differences.  We  doubt  that  it  is  practical 
to  build  a  phoneme  recognizer. 

Phonemic  analyses  arc  made  by  human  beings  with  all  the 
reaourcelulness  of  the  human  brain.  No  one  knows  how  the  brain 
works,  but  wc  do  know  Ih.'it  it  is  the  most  efficient  self-adjusting  mechanism 
ill  the  world.  Speech  takes  advantage  of  this  fact  by  requiring  listeners 
to  ignore  some  details  and  concentrate  on  others.  The  problem  in  auto¬ 
matic  speech  recognition  is  that  we  do  not  know  how  to  build  a  machine 
tliat  is  equally  .self-adjusting.  A  human  being,  hearing  a  specific  sound 
feature,  can  decide  whether  it  is  important  and  should  be  noted  or  it  is 
irrelevant  and  should  be  ignored.  In  oilier  words,  he  can  decide,  as 
each  sound  segment  occurs,  which  aspects  he  sliould  pay  attention  to 
and  wliich  he  should  ignore.  He  may  pay  attention  to  a  feature  in  one 
context  and  ignore  it  in  another.  A  machine  could  do  this  only  if  it  were 
given  a  precise  description  of  the  circumstances  under  which  the  feature 
should  be  noted  or  ignored.  If  we  give  a  niacliine  sueli  a  description 
(assuming  that  such  a  description  is  possible),  we  render  it  incapable 
of  making  any  adjustment  whatever  for  the  assimilations  which  occur 
wlien  one  sound  follows  another.  If  we  do  not  tell  it  to  ignore  certain 
features  under  certain  circumstances,  then  the  number  of  different  sounds 


8 


Jl 

II 


I 


which  it  recognizes  will  be  greater  than  the  number  of  different  sounds 
which  a  native  speaker  of  the  language  needs  to  recognize.  However,  given 
the  choice  between  a  machine  which  is  not  flexible  and  one  which  makes 
some  distinctions  which  are  non-phonemic  and  perhaps  "unnecessary",  the 
authors  at  present  believe  that  the  machine  which  perceives  non-phonemic 
distinctions  is  to  be  preferred. 

In  all  languages,  there  are  some  phonetic  variations  which  are  the 
result  of  one  sound  influencing  a  nearby  sound.  Some  of  these  occur 
every  time,  and  thus  they  are  predictable;  others  occur  at  some  times  and 
not  at  other  times,  and  thus  they  are  not  predictable.  The  front  Tk]  of 
key  is  predictable:  it  always  occurs  before  front  vowels.  The  voiceless 
dental  .stop  CTj  which  sometimes  replaces  the  English  voiceless  dental 
spirant  in  thin)  is  not  predictable.  The  phrase  with  care  is 

pronounced  sometimes  with  at  the  end  of  with  and  sometimeb  with 

Since  we  cannot  give  a  machine  a  precise  rule  about  the  substitution 
of  lT' ]  for Jbl,  the  solution  is  to  let  the  machine  recognizeCT']  as  a  separate 
entity.  When  it  fails  to  find  fwiT*]  in  its  dictionary,  it  will  check  with  the 
ordered  set  of  sounds  (described  in  the  following  sections),  note  IhattT'.l 
is  only  one  leaf  removed  from  ([hi  ■  decide  that  the  work  ii>  question 
is  with  . 


I^Thus,  although  in  Vietnamese  (see  Appendix  A  for  details)  fk] 
and  are  considered  in  at  least  one  analysis  to  be  phoncmically  the  same, 
we  will  not  require  an  automatic  speecli  recognizer  to  recognize  them  as 
the  sa,rne  sound.  We  require  that  the  machine  distinguish  between  final  f  p] 

^  _ .  r. «  .  1  _ 1  I  ^  ^ki  _  ,  : 


fLi5.r. 


and  pj  because  Uiu  language  contains  the  words  LtAopJ  which 

must  be  distinguished.  But  this  is  our  only  sti)>ulation  about  tlu?  recognition 
of  final  slop  consonants  in  this  language.  We  need  not  inst^riicl  the  machine 
to  note  the  p  closures  of  Tpj  but  ignore.’  the  p  closure  of  ^ 


We  propose,  in  gi.meral,  that  Die  machine  recognize  sound  segments 
as  a  Uitguist  r(;cognizes  "phone  classes"  in  analyzing  an  unknown  language. 
These  basic  machine  acoustic  se  gments  will  be  further  described  in 
Sections  4  and  5;  for  the  present  it  is  sufl'ieicuiL  to  lunpliasize  that  these 
segments  will  be  designed  to  distinguish  non-plionemii:  differences,  such 
as  ki  and  ko  nienlioiied  above. 


We  propos(^  then,  that  t!ie  machine  recognize  phone  classes  as 
the  linguist  does  when  he  is  analyzing  n  new  language.  These  phone  eiasKi‘s 
will  not  necessarily  coincide  with  tliose  that  any  linguist  would  use,  any 
more  tlian  the  phone  classes  of  one  linguist  necessarily  coincide  with  thosi* 
of  another.  Tlie  machine*  will  probably  have  more  phone  classes  than  most 
linguists.  It  will  also  have  a  complete  dictionary  of  the  language,  written 
in  its  phonetic  script.  Whenever  it  receives  a  word  wh.ich  is  not  in  its 
dictionary,  il  will  refer  back  to  its  rules  and  to  the  matrix  to  determine 


9 


what  sound-substitution  has  taken  place. 

There  is  an  added  advantage  to  having  the  naachlne  recognize  phone 
classes.  We  can  switch  it  from  one  language  to  another  by  giving  it  another 
dictionary  and  set  of  rules.  The  phone  classes  it  recognizes  will  remain 
the  same.  By  designing  a  phone  class  recognizer  we  achieve  a  flexibility 
and  reliability  which,  although  they  do  not  match  those  of  the  hum.an  brain, 
are  considerably  greater  than  those  which  could  be  achieved  by  a  phoneme 
recognizer, 


SECTION  2:  GENERAL  DISCUSSION  OF  THE  MULTIDIMENSIONAL  MODEL 


INTRODUCTION 

The  purpose  of  the  model  is  an  orderly  presentation  and  analysis 
oi  the  phenomena  of  human  speech*  In  effect  we  are  trying  to  represent 
how  the  different  freedoms  wliich  every  speaker  has  in  shaping  the 
sounds  of  his  words  can  be  classified  to  allow  for  the'  possibilities  of 
individual  sound  variations,  then  presented  in  a  regular  order  so  that 
they  may  easily  be  interpreted  by  human  or  mechanical  means. 

ill  articulating  a  sound  the  individual  speaker  has  a  number  of 
freedoms  in  its  production.  On  the  other  hand,  he  is  limited  to  a  given 
combination  of  influences  he  may  use  at  any  one  time.  By  identifying  the 
major  sources  of  physical  action  whicli  a  speaker  may  use  to  produce 
a  sound,  then  studying  their  influence  upon  each  other,  wo  come  to  an 
orderly  formulation  of  Itow  speech  is  produced, 

in  our  model  each  dimension  signifies  an  independent  choice 
whicii  a  speaker  makes  in  uttering  a  sound  or  a  phrase.  He  has  thv. 
freedom  to  ehu^ise  where  he  will  place  )us  tongue  in  his  mouth  when 
utt<‘ring  a  sound  (place  of  artic.uiulion);  how  lie  will  move  his  taigue  and 
to  sliapi;  the  sound  (manner  of  articulation);  and  whellu-r  to  use  only 
liis  moulli  as  an  echo  chamber  or  wludlu.T  lo  arid  Die  vocal  fia))s  and  the 
luisai  passages  (resonance);  in  pronouncing  vowels;  moreover,  a  speaker 
may  decide  how  U)  move  his  i  h<-ek  and  jaw  muscles,  llii!reby  delernuning 
ss  fiich  vowel  h('  is  enuiu  ialiug. 

in  addition  lo  tlu^se  ))liysical  means  of  producing  speech  a  speak(?r 
may  use  subthr  variations  of  empliasis.  They  include  hosv  long  lo  teike  iu 
pronouncing  a  sound  (duration);  wlial  sounds  to  stress  (intensity);  and 
w’here  to  vary  piti  h  for  runphasls  (freriumn  y );  such  freedoms  cjuaiil'y  as 
dimensions  parliruiarly  because  iliey  can  ad  as  independent  agcuits  in 
modifying  the  acoustic  wavc'forms  of  speech.  Moreover,  they  reprtJSenl 
a  .special  type  of  morlification  c(>nvenicnlly  studied  in  a  si-parate  cati'gory. 

Having  provided  a  I'raMuwvork  for  analyzing  speech  patterns,  W'e 
face’  twi>  further  tasks,  diseussed  with  eaeli  dimension.  Tlie  fir.sl  is 
lo  ai  eounl  lor  varialimis  of  choice  that  occur  in  normal  speech  --  It  is 
a  tautology  for  instance  lluil  an  individual  seldom  pronc>unees  a  word 
identically  Iwir  e  in  a  row,  nor  do  any  two  people  pronounce  the  same 
word  irxac.  lly  alike.  Wiiile  furnislung  classifications  adaptable  enough  lo 
C4)ver  such  variali(»ns,  however,  we  must  always  keep  in  sight  our  main 
objective  that  such  a  mtnlel  be  able  lo  he  developed  for  practical  use. 


To  accomplish  both  ends  the  following  discussion  constantly  relates  the 
work  of  measuring  speech  phonetically  to  the  equally  important  task  of 
processing  speech  patterns  by  mechanical  means. 

I.  DISCUSSION  OF  THE  DIMENSIONS 

Appendix  B  shows  a  chart  showing  the  interaction  of  several 
separate  dimensions  of  our  model.  The  horizontal  axis  showed  how  a 
certain  sound  might  be  heard  when  pronounced  at  the  same  place  in  the 
mouth  but  given  different  resonances.  Tlie  vertical  axis  showed  the  effect 
different  positions  of  the  tongue  {place  of  articulation)  would  produce 
within  the  same  resonance.  Below  the  first  chart  wc  have  indicated  on 
a  separate  chart  how  use  of  the  tongue  in  a  given  position  can  further 
modif/  a  sound  (manner  of  articulation), 

All  the  sounds  indicated  on  the  front  plane  arc  sounds  wc  define 
as  con.sonants,  (For  th(?  sake  of  clarity  the  consonants  on  the  front  plains 
are  represented  on  separate  leaves,  which  c  orre.spond  to  the  consonant 
classes  --  stop,  nasal,  sibilant,  affricate,  etc.  )  These  sounds,  however, 
are  furtlior  modified  by  vowels  that  follow  the  consonants.  An  open  a, 
for  example,  will  modify  almost  every  preceding  consonant  shown  on  the 
front  cltart.  For  (his  reason  we  have  indicated  on  the  z  axis  a  series  of 
planes  showing  the  sounds  of  consonants  as  they  are  heard  when  articulated 
with  given  vowel  sounds;  k  becomes  ku,  ka,  ko;  g  becomes  gu,  ga,  go, 
etc , 

Only  a  few  dimensions  are  shown  on  our  i  harts,  if  sitould  be 
noted,  and  these  primarily  as  an  iUmslralion  of  how  the  separate  dimen¬ 
sions  can  be  related  to  each  oilier  tor  machine  idimliticaliun.  Each 
of  (lie  tollowing  dimensions,  liowevcr,  rc]>r('sents  an  independeni 
mellioci  by  which  (he  sounds  o(  a  voice  may  he  varied;  every  such  eateg- 
ory  must  he  studied  separ;ite!y  and  given  iipccial  Ireatnieiu  in  tlic 
construction  of  our  model. 

Having  illustrated  how  the  inlerai  tioii  of  the  separate  diiiieusions 
(  an  be  represented,  we  may  turn  to  a  more  specific  ('xaminalion  of  how 
the  sounds  lluis  graplied  are  physically  pruilueed  and  liow  tliey  can  tlum 
be  turned  into  signals  that  a  machine  can  classify.  This  we  will  do  in 
our  following  discussion  of  (lie  dimensions. 

A.  MANNER  OF  ARTfCULATlON,  PLACE  OF  ARTIC ULATION , 

AND  RESONANCE 

Manner  of  articulation,  place  of  articulation,  and  I'csuiiance  arc 
all  physical  modes  through  which  men  can  alter  their  speecli  .so  Lliat  it 
may  produce  new  informalion- bearing  varieties  of  sound  for  llie  lunnan 


IZ 


ear.  Equally  important,  these  modes  ot  change  need  to  be  measured 
by  mechanical  means,  thereby  creating  an  opportunity  for  the  trans¬ 
cribing  machine  to  recogniKe  the  same  sounds  as  the  human  ear. 

In  this  section,  accordingly,  we  must  give  equal  attention  first 
to  the  many  physical  ways  of  varying  sounds  for  the  human  ear  and  then 
to  the  limited  number  of  ways  a  machine  may  record  them.  For  machines 
there  are  two  principal  ways  that  such  sounds  can  be  represented.  The 
first  is  spectral  analysis  of  energy  concentrations  in  the  speech  wave. 

The  second  is  time-amplitude  plots  of  the  changing  shape  of  the  sound¬ 
wave  that  occurs  in  a  given  word. 

In  the  subsection  below  sounds  are  classified  according  to  their 
similarity  of  physicarpToducltoir  (as  indicated  in  the  X-ray  photographs). 
They  must  then  be  measured  according  to  their  physical  effect  on  the 
facilities  of  a  transcribing  machine  (as  indicated  in  the  spectrograms 
and  time -amplitude  plots.  ) 

The  characteristic  patttsrns  of  spectrographs  and  time -amplitude 
plots  arc  particularly  important  in  helping  Ihe  machine  identify  manner 
of  articulation:  thus  we  include  example, s  of  both  measureinents  as  well 
as  illustrative  X-ray  photographs  for  mosL  phone  classes  in  subsection  A. 
Place  of  articulation,  however,  can  be  distinguished  primarily  by  Us 
formant  transitions  to  and  from  adjacent  vowels.  These  transitions  are 
considered  in  our  following  troalment  of  vowels.  For  this  reason  wo 
liave  included  only  a  few  illustrative  figurc:s  for  place  of  articulation 
and  resonance. 

It  should  finally  be  noted  that  the  following  three  topics  are 
separate  dimensions.  They  are  discussed  together  lor  conveniojice 
in  relation  to  the  preceding  i-hart  showing  Ihui  r  .s|)eeific'  interaction 
in  helping  our  model  translate  physical  t  ojubinations  of  sound  into  paUeriis 
identifiable  by  a  muehine.  (.See  .Appendix  It) 

I .  M quite  r  of  Arliculation 

The  distinguishing  eharac le  ristii  of  litis  dimension  is  how  the 
tongui’  and  lips  nrtt  used  ill  producing  a  souiul.  The  diitiension  lias 
five  suh'livi bions.  These  are  pr,  .senit-d  in  llu:  following  ehart,  witli 
examples  ot  Ihe  phones  which  fall  into  lhe.se  subdivisions.  For  each 
example  tlie  usual  Eiiglisli  spelling,  Ihe  phonetic  symbol  and  a  word  con¬ 
taining  Ihe  sound  are  given. 


13 


Manner  of  Articulation  Example 

Usual  Spelling  Phonetic  Symbol  Word  Contain- 
_ _ _ ing  the  example 


(1) 

Stops 

p,  d 

P,  d 

pin,  din 

(2) 

Spirants 

£,  th* 

f,  ^ 

fin,  the 

(3) 

Nasals 

n,  ng* 

n,  7 

sin,  sing 

(4) 

Sibilants 

z,  sh* 

zoo,  she 

Ci) 

Affricates 

ch>!S  j 

,  * 

c,  J 

chin,  join 

(6) 

Liaterals 

1,  1 

L,  1 

dull,  let 

It  should  bo  noted  that  although  two  loiters  are  needed  to  spell  this 
sound  in  English,  it  is  not  two  oounds,  but  one.  This  is  what 
phoneticians  call  a  digraph. 

The  characteristics  oi'  the  subdivisions  ol  this  dimension  are 
described  below: 

(1)  Stops  :  The  stops  have  a  complete  closure  somewhere  in  the 
mouth  for  a  brief  period,  which  results  in  an  almost  complete  cessation 
of  sound.  This  cessation  lasts  about  twenly  miliiscconds.  The  velum 
at  the  top  of  tlu;  throat  closes  during  the  articulation  of  a  slop,  so 
that  no  air  escapes  through  the  nose. 

Gunnar  Fant  (Acoustic  Theory  of  Spue cl>  Production  ,  p.  186, 

Figures  Z.h  -  H)  presents  X-rays  of  the  side  of  tlie  mouth:  while  tlu; 
stop  is  being  pronounced,  the  point  of  contact,  is  identical  with  that 
of  the  slop  p,  but  the  latte? r  is  not  voiced. 

A  liiiie-ainplitude  plot  of  Uie  b  sound  {I^lgure  1)  shows  energy 
pre'seiit  uveui  durini*  closure;.  {'I’his  caji  also  be*  delected  in  Use:  Lelilste 
and  Gordon  Peterson,  "Sludit‘s  of  Syllable  Nuclei  <i",  page  54,  spectrogram 
of  bit  •  ))  this  is  t>ne:  way  a  mae  liine  jnlglit  distinguish  voieu'd  stops  from 
voiceless  stops,  be*cau8e  ihe:re:  is  ni>  energy  present  during  cie)sure  in 
tlie  case  e)f  voice*ie’so  stops. 

(1)  bpir£ints  ;  ‘I’lu*  si.nr;iiils  are  arLieuiiatLjd  by  fe>rn\iiig  a  euiistrie:- 
Lion  in  the  buccal  or  moulli  cavity,  in  most  t  iisi*s  this  e  onslrieliuji  is 
lejri.icd  by  raising  some  part  of  Un*  t<.mgue  until  it  aiiiii^sl  touches  the^  roof 
of  the  mouth  or  the  uppe:r  teeth.  In  tlu:  case:  of  the  Labiodental  spirants, 
lioweve-r,  .such  as_f,  or  tlie  constriction  is  fornied  by  jjlae  ing  tlu:  upper 
teeth  very  close  to  the  lower  lip  as  in  Gunnar  Pant's  X-ray  tracing  of  f. 


14 


(Acoustic  Theory  of  Speech  Production,  p.  170,  Figure  2.  6-1), 

The  velum  is  closed  for  the  a.rticulation  of  a  spirant.  Both  the 
spectrogram  of  v  (Lehiste  and  Peterson,  "Studies  of  Syllable  Nuclei  2" 
p.  10,  spectrogram  of  veal),  and  the  time  amplitude  plot  ofjf^.  (Figure  2) 
reveal  a  characteristic  presence  of  random  energy  wliich  a  machine 
might  identify. 

(3)  Nasals:  Physically,  nasals  are  articulated  almost  the  same 
as  stops;  the  only  difference  is  that  for  a  nasal  the  velum  remains  open, 
so  that  a  stream  of  air  escapes  from  the  nose.  Thus  the  articulation 
for  m  (Fant,  p.  140,  Figure  2.  4-1),  is  the  same  as  that  for  b  (Fant, 

p.  186.  )  except  for  the  velum  at  the  back  of  the  mouth.  Accoustically, 
nasals  differ  frojn  stops  because  there  is  no  cessation  of  sound  and  no 
sharp  burst  of  air  when  the  c  losure  is  released. 

These  differences  might  be  easily  identified  by  a  machine 
because  nasalization  creates  an  extra  low-frequency  formant  on  a 
spectrogram  in  addition  to  the  three  formants  normally  present  in  con¬ 
sonants  and  vowels  (Lehiste  and  Peti^rson,  p.  10,  .‘spectrogram  of 
"coin".  )  The  tinie-aiixplitude  plot,  moreover,  shows  the  relatively 
great  intensity  of  low  frequencies  in  nasal  sounds,  indicated  by  the 
comparative  regularity  and  wide  spacing  of  the  peaks  and  valleys 
(Figure  3),  although  the  peak  radiated  amplitude  of  nasals  is  signifi¬ 
cantly  lower  than  that  of  many  vowi  1  sounds. 

(4)  Sibilants:  The  sibilanls  are  like  the  spirants  in  that  they 
involve  a  constriction  in  the  mouth,  but  the  shajxc  of  the  tongue  is 
different,  and  the  chartieliiristie  sound  is  produced,  not  at  the  point 
of  constriction,  '  ut  when  the  air  whicli  luas  rushed  through  this  con¬ 
striction  hits  the  upper  front  tcudli  (Von  Jtssen,  1963,  p.  73.  ) 

As  an  illustration  of  this  diffeiunice  in  pronunciation  we  include 
Ivvu  palatograms  of  tlu'  spirant  th  (Figure  4)  and  tlie  sibilants  s  or  /. 
(iUgure  5).  Black  areas  represent  close  contact  between  llie  tongue  and 
tile  roof  of  tile  iiiuutii,  gray  aretts  re|)rest:nt  loose  contact.  Tile  close 
contact  and  narrow  opening  of  spirants  as  opjjosed  to  llie  luose  contact 
and  diffuse  d  opening  of  tlie  sibilanls  is  apparent.  (For  a  dese  riplion  of 
liow  palatograms  are  in.ide  see  Appendix  C).  bibilanls  are  acoustically 
different  from  spirants  in  that  a  sibilant  lias  no  energy  below  a  certain 
frequency,  but  very  liigli  random  energy  above  that  frequency  (as  in  tlu' 
spe<  trograin  of  s^.  (Lehiste'  and  Peterson,  p.  61,  Speelrogram  of  sliag  ) 
wliile  spirants  have  ineicli  iow'er  random  energy  spreiad  lliroeigliout  the 
spemtruni.  Time -amplitude  plots  of  sli  also  reveal  tliat  sibilants  havei 
eenergy  at  higheir  frequencie  s  Hum  spirants  do.  (Figure  6).  (According  to 
Katlierine  S,  I-Iarris,  1963,  sibilanls  also  differ!*  from  spirants  in  that 


they  can  be  identified  in  listener  tests  without  the  help  of  transitional 
cues  from  the  adjacent  vowels,  while  spirants  cannot  be  identified 
without  transitional  cues). 

(5)  Affricates  :  Physically  affricates  are  produced  by  releasing 
the  energy  of  a  stop  burst  into  the  energy  of  a  following  sibilant.  As  the 
mouth  is  in  almost  constant  motion  we  omit  a  physical  diagram  of  its 
articulation;  the  reader  can  discover  how  affricates  are  formed  by 
noting  the  similarity  between  white  shoes  and  why  choose. 

In  measuring  affricates  mechanically,  traditional  phonetics 
has  assumed  that  the  energy  of  t!ie  stop  burst  appears  just  at  tlie 
beginning  of  the  forllowing  sibilant.  We  have  acoustic  evidence;, 
however,  that  the  major  release  of  the  stop  energy  occurs  not  at 
the  beginning  of  lUc  sibilant,  but  in  the  middle  of  it.  This  can  be 
seen  in  the  tim(;-amplitude  plot  (Figure  7).  Information  from  the  time- 
amplitude  plot  is  more  significant  for  our  model  than  information  from 
the  spectrogram  in  charade riviing  affricates,  because  spectrograms 
are  less  able  to  record  variations  in  sound-wave  intensity. 

,(6)  I/aterals:  In  the  articulation  of  a  lateral,  such  as_l,  tlie 
midclle  of  the  tongue  is  in  firm  contact  with  the  teeth  or  the  roof  of  the 
juouth,  and  the  air  stream  (escapes  at  the  side  of  the  longue.  It  has 
proved  very  difficult  to  give  a  t  lear  acoustic  descrijHion  of_l;  researchers 
at  Haskins  Laboratories  (O,  Connor,  Gerstnuin,  Liberinaii,  Delatin', 
and  Cooper,  1957)  also  report  difficuiticb  in  synthesi/ang  it. 

Z,  Place  of  Articulation 

TIk?  place  ol  articulation  is  that  part  of  llu;  buccal  (ivioulh) 
cavity  which  has  the  smallest  cross-seciioiv  during  tlie  arlicu'ation  of 
a  specific  sound.  It  is  det(;rmined  by  the  position  of  the  urliculalurs 
((hi-  longue  or  <he  lips)  rut!o*r  than  h.o'.v  lluey  arc  used  in  a  given  position, 
in  this  study  we  will  consider  six  places  of  artii’ulation;  these  are  listed 
in  tl>e  following  chart  with  «*xanipl(?s  of  the  phones  wliich  fall  into  these 
subdivisions.  Tlie  heading  "Usual  spelling"  means  usual  Englisli  spediing. 

pUicc  of  Articulation  Example 

Usual  Spelling  Plionetic  Symbol  Word  Containing 

tlie  Example 


(1)  Guttural 

k.  <■  ("hard”), 

k 

came 

K  ("hard”) 

.  ,11 

game 

{Z)  Palatal 

sh 

shin 

(5)  Alveolar 

t.  d 

t,  cl 

tin,  din 

(•1)  DtMitaJ 

th,  III 

=Hhln,  -iTheii 

{‘})  Labiodc'iital 

f .  V 

f.  V 

fear,  veer 

■(6)  Labial 

p.  Ill 

p,  m 

peer,  mere 

1 

4 


^  It  should  be  noted  that  thin  and  then  do  not  have  the  same  initial 
sounds  although  they  start  with  the  same  letters;  also  both  words 
have  only  one  consonant  before  the  vowel,  although  two  letters 
are  needed  to  spell  the  consonant. _ 


{1)  The  Guttural  Phones  :  Gutteral  phones  (i.  e.  speech  sounds) 
have  the  smallest  cross  section  in  the  back  part  of  the  mouth,  as  in  the 
X-ray  of  k  (Fant,  p.  186,  Figure  2.  8-6).  The  constriction  is  achieved 
by  raising  the  back  part  of  the  tongue. 

(2)  The  Palatal  Phones  ;  The  palatals  have  the  smallest  cross- 
section  in  the  mid-part  of  the  buccal  cavity.  This  is  achieved  by  rais¬ 
ing  the  middle  part  of  the  Longue  towards  the  hard  palate  (the  middle  part 
of  the  roof  of  the  mouth). 

(3)  The  Alveolar  Phones:  The  alveolar  phonos  have  the 
narrowest  cross-section  at  the  alveolar  ridge,  just  in  back  of  the  upper 
front  teeth.  The  constrietjetion  is  achieved  by  the  bringing  the  tip 

of  the  tongue  to  the  alveolar  ridge. 

(4)  The  Dental  Phones:  The  dental  phones  have  the  narrowest 
cross-section  at  the  upper  front  teeth.  There  are  actually  two  places 
of  articulation  involved  here;  the  tip  of  the  tongue  may  be  brought 
either  to  the  back  of  the  teeth,  as  in  th(!  X-ray  of  i  (Fant,  p.  186, 

Figure  2.  8-6)  or  to  their  biting  edges. 

(5)  The  Labiodental  Phoiu;s  ;  Tlie  labiodental  phones  liave  the 
greatest  eonstrietion  between  the  upper  lecUh  and  the-  lower  Up:  .see 
the  illustrations  forn']under  the  Spirants. 

(6)  Tlie  Labial  Phonos  :  The  labial  phones  have  the  greatest 
conslriclion  between  the  two  lips;  see  the  illustrations  for  ^pj  under 
the  slops. 

(7)  Variability  of  Place  of  Articulation:  Some  Englisli  phonemes 
sliow  much  more  variation  in  the  place  of  articulation  tlian  utliers  do. 

For  most  speakers,  the  place  of  articulation  of  all  allophunes  of  /p/ 

is  approximately  the  same,  but  there  are  ihrce  places  of  articulation 
for  Ihe  allophont'b  of  / 1/  in  leave  ,  lilt,  and  milk,  and  the  difference.s 
in  pronuneialion  ;ire  audible;  /p/  lias  a  narrow  range  because  there  are 
two  other  voiceless  slop  phonemes  in  English,  and  tlie  existence  of 
these  others  limils  the  variability  of  /p/.  Since  /!/  is  the  only  lateral 
phoneme  in  English,  tlu’  sole  limits  on  its  varialsility  are  pliysiologicai; 
all  parts  of  the  longue  do  not  lend  Ihuniselvms  equally  well  to  lateral 
articulation.  The  extreme  variability  of  /!/  probably  explains  why 


17 


researchers  have  not  succeeded  in  discovering  its  acoustic  characteristics. 

The  sibilants  are  another  group  of  sounds  which  show  variation 
in  place  of  articulation,  but  the  situation  here  is  somewhat  different. 

The  sibilants  are  the  only  phone  classes  whose  characteristic  sounds 
do  not  originate  at  the  point  of  constriction.  We  have  already  dis¬ 
cussed  the  fact  that  the  hissing  sound  of  the  sibilants  arises  when  air 
which  has  been  channeled  through  a  narrow  groove  hits  the  teeth.  The 
narrow  groove  is  formed  by  the  longue,  and  the  difference  between  f sj 
and  C  (ss  in  see  and  she)  lies  in  the  size  of  the  groove:  has  a 

smaller  groove  which  is  usually  made  with  the  tip  of  the  tongue,  and 
[  J]  is  made  with  the  part  of  the  tongue  just  behind  the  tip.  The 
difference  between  fs]  and  J  J  can  be  analyzed  either  as  a  difference 
in  manner  f  j]  is  articulated  with  a  wider  groove),  or  as  a  difference 
in  place  f  t  J  articulated  further  back).  At  present  we  choose  to 
consider  it  a  difference  in  place,  but  we  may  change  our  analysis  later. 

When  we  come  to  the  rules  of  euplionic  combination,  the  dif¬ 
ferent  degrees  of  variability  will  prove  important. 

3.  Resonances  and  Aspiration 


There  arc  three  typos  of  resonance  --  voiceless  (only  the 
friction  of  breath  within  the  mouth);  voiced  (using  vibration  of  the 
vocal  flaps  for  modulating  air  that  flows  Ihrough  the  buccal  cavity) 
and  voiced  plus  nasal  (coupling  the  nasal  cavities  with  the  previous 
system).  In  this  subsection  we  also  include  aspiration  because  it 
occurs  in  English  only  with  voiceless  stops.  As  these  resonances  modify 
almost  every  sound  that  the  mouth  can  i>hysically  produce,  their  identi¬ 
fication  in  a  separate  dimen.sion  is  important  to  our  model. 

(1)  Aspiration.  An  aspirated  phone  is  one  which  has  an  audible  rush 
of  air  aftcT  Ih-e  phone  itself  is  articulated,  it  should  he  noted  th.at  the 
burst  of  sound  when  a  slop  is  released  is  not  aspiration;  it  is  a  neees.sary 
pari  of  the  articulation  of  the  slop. 

As  indicated  abeve,  llie  eontrasl  between  aspirtiled  and  imaspirated 
sound  is  included  with  the  ri-sonaiices  because  in  English  only  certain 
slops  are  aspirated.  Stops  which  are  voiceless  (such  as  p]  are  al.so 
aspirated;  stop.s  which  are  voiced  (such  as  [  l^]  )  ai'i'  iinaspiraletl.  The 

only  exception  to  this  is  that  "  p]  ,  f  I  J  ,  and"k)  ,  are  not  aspirated  after  '  s]  . 
Tliis  means  that  the  p  in  pin  is  not  like  the  p  in  spin. 


18 


There  is  one  aspect  of  aspiration  which  requires  particular 
study.  The  terms  "aspirated"  arid  "unaspirated"  are  usually  applied 
only  to  stops,  but  the  Sanskrit  grammarians  reported  that  their 
language  had  aspirated  and  unaspirated  affricates  (e.  g.  aspirated 
and  unaspiratedQ j  J ).  The  articulatory  mechanism  and  the  acoustic 
characteristics  of  these  two  types  of  affricates  are  not  known  and  may 
merit  investigation. 

(2)  Voiced -Voiceless.  The  terms  "voiced"  and  "voiceless"  refer  to 
the  action  of  the  vocal  cords.  When  a  phone  is  voiced,  the  vocal  curds 
touch  each  other  and  vibrate;  when  it  is  voiceless,  the  vocal  cords 
remain  still;  they  are  spread  far  enough  apart  that  only  a  slight  friction 
is  created  by  the  air  rushing  through.  The  following  list  gives  a  few 
examples  of  pairs  of  phones  which  are  identical  except  that  one  is  voiced 
while  the  other  is  voiceless. 


Voiced 

Usual  Phonetic  "Word  Contain- 

SpellinK  Symbol  ing  Phone 

Usual 

Spelling 

Voiceless 

Phonetic  Word  ConLain- 

Symbol  ing  Phone 

th 

■cT 

then 

til 

thin 

V 

V 

vcc  r 

r 

'i 

fear 

z 

z 

use  (verb) 

6 

s 

use  (noun) 

j 

j 

jeer 

v.h 

c: 

clieer 

The 

following  list  shows  tlie  contrasts  of 

voiceless' 

•aspirated  with 

voiced 

una  spiraled. 

The  third  maj 

or  column  gives  examples  of  voice- 

less  uiuispirated  stops. 

Cc'lum.n  1  Column  2  Culuiuu  i 


Voiced  unaspiriited  Voice  i<'ss-asi)iratiul  Voiceless -unaspirated 
Usual  Phonelie  Word  Usual  Phonelie  Word  Usual  Plionelie  Word 


.Spell  -  Symbol 
ing 

Contain-  Spell¬ 
ing  Phone  ing 

Symbol 

Contain¬ 
ing  Pliunu 

Spell - 
ing 

Symbol 

Contain¬ 
ing  Phone 

«  K 

gall-  k 

k' 

Kale 

K 

k 

skate 

d  d 

dale  1 

1' 

tale 

1 

1 

stale 

b  ij 

bill  p 

P' 

pill 

p 

p 

spiJ  1 

19 


It  should  be  noted  that  the  phones  of  Column  3  do  not  contrast 
(are  not  used  to  differentiate  words)  with  the  phones  of  Columns  1  and  2 
in  the  same  way  that  the  phones  of  Column  1  contrast  with  the  phones 
of  Column  2,  The  phones  of  Column  3  can  occur  only  after  £  (at  least 
in  English].  The  phones  in  Columns  1  and  2  can  occur  everywhere 
except  after  ^  The  phones  of  Column  3  have  one  characteristic  of 
Column  1  and  one  of  Column  2.  Phonetically,  the^  of  spill  is  a 
compromise  between  the  p  of  pill  and  the  b  of  bill. 

(3.)  The  Voiced  Nasal  Resonances.  Voiced  nasal  resonances 
occur  when  nasal  passages  are  open  and  the  vocal  cords  are  vibrating. 
The  nasal  cavity  sets  up  resonances  which  differ  from  the  oral  reson¬ 
ances  in  two  major  respects.  The  first  formant  is  weaker  for  nasal 
resonance,  and  an  extra  formant,  frequently  referred  to  as  a  "nasal 
formant"  is  present  between  the  first  and  second  formants.  ("First 
and  second  formants"  is  used  here  to  mean  the  first  and  second  oral 
formants.  ) 


(4.)  Unresolved  Clas.si fications.  There  arc  some  English 
consonants  which  have  not  yet  been  fitted  into  the  model.  These 
include  ;[,hj  ,  [_rj  .  L  y]  ,  [  w]  ,  and  vocalic  [m]  ,  !^n]  ,  [?]  aj. 

The  factors  influencing  classification  of  these  problem  segments  will 
be  discussed  in  Section  4  of  this  report,  which  pre,sents  the  acoustic  data 
studied  on  this  project.  The  reader  is  referred  to  tliat  section  for  a 
resolution  of  those  cla.ssifieations. 

il.  VOWELS 


This  dimension  ineludi-s  all  of  Ihi'  phonemieally  distinct  vowels 
of  American  English.  It  also  includes  ail  the  nasaliKed  vowels  and  the 
r -colored  vowels;  these  modificid  vowels  are  treated  as  separate  units 
because  their  acoustic  characlerislies  are  so  distinct  that  it  tiiiglil 
prove  difficult  lo  have  llte  machine  iH’eognir.e  litem  as  ordinary  vowels. 

Nasaliaed  vowels  are  quite  eoinuion  before  nasal  consonants 
in  American  English.  As  a  iiiattt^r  of  fact,  it  is  unusual  for  an 
American  to  say  the  word  can  with  a  vowel  which  is  not  nasalized. 

Tlie  difference  between  a  nasalized  and  purely  oral  (non-nasalized) 
vowel  is  sometimes  word -diffei'entiati ng.  In  rapid  speech,  final 

I  I  '  n  and  f_  nj  's  sonielimcs  arc  omitted,  and  wlien  this  happens  the 
only  difference  between  cat  and  can  (with  the  meaning  "be  able") 
is  that  eat  has  a  purely  oral  vowel,  while  c an  has  a  nasalized  one. 

The  r-colored  vowels  are  also  quite  common  in  American 
English.  They  occur  before  r.  After  vowels  ris  sometimes  omitted, 
.md  when  this  happens,  the  r-eoloring  may  be  word-differentiating. 


20 


We  have  two  reasons  for  treating  the  vowels  as  a  separate 
dimension. 

First,  there  is  a  major  problem  in  segmenting  a  vowel  from 
its  adjacent  consonant.  Second,  there  is  evidence  that  the  articulation 
(and  hence  the  acoustic  characteristics)  of  a  given  consonant  depend 
in  part  on  what  vowel  follows. 

The  ^  of  tea  ,  for  example,  is  not  identical  with  the  t  of  top. 
Researchers  have  found  they  couid  not  divide  a  vowel  from  surrounding 
consonants  without  some  overlap  of  the  two  sounds.  For  this  reason, 
we  must  specificaiiy  study  the  combinations  of  vowels  and  initial 
consonants. 

A.  SEGMENTATION 

Until  fairly  recently,  phoneticians  did  not  realize  that  scgment.ation 
was  a  problem.  They  worked  witii  very  few  instruments,  and  their  most 
importani  techniciue  was  to  articulate  the  sounds  they  were  interested 
in  and  observe  their  own  arlieulalory  processes  closely.  One  difficulty 
with  this  approach  was  that  when  they  articulated  a  sound  they  were 
interested  in.  they  sustained  the  sound  far  longer  titan  any  normal 
speaker  would;  hence  they  gained  the  impression  that  the  transition 
fran  one  sound  to  another  is  only  a  tiny  fraction  as  long  as  its  steady- 
state.  .  When  X-ray  motion  pictures  were  made,  however,  everyone 
who  looked  at  tlium  realized  that  sleady-slates  were  only  a  small 
part  of  the  total  duration  on  an  utlerani  e. 

At  the  time  of  the  first  X-ray  movies  it  was  stili  possible  to 
say  litat  tlte  steady-siattts,  althotigh  brief,  were  tite  essential  part  of 
the  speeeh  wave,  and  the  transitions  liad  no  function  in  the  perception 
of  speeeh;  lal(!r  experiments  with  tape-reeorders  have  shown  tlial  this 
is  not  Use  ease  either.  The  procedure  in  these  e-xpi'viTnenl  s  w;ik  to 
c^rase  part  of  the  recording  and  then  note  the  listencT's  response  to  the 
remainder. 

Martin  Joos  desv  rib(;d  (Joos  1948  p.  IZi,  1(1,1)  an  expt: rinuuit 
with  the  syllable  tel  (from  the  word  hotel  ).  He  first  cut  off  the  slop- 
gap,  noise  burst,  and  aspiration  of  as  well  as  the  tirsl  ^0  or  30 
milliseconds  of  the  voiced  portion  wlueli  Joos  considered  part  of  llie 
vowel.  When  he  played  thi.s  tape  to  a  group  of  listeners,  the  largest 
number  said  they  lieard  tell,  a  smaller  number  said  dell  ,  and  a  few 
said  sell  ,  ell  ,  or  hell  .  Since  only  a  very  small  number  said  ell  , 
this  means  there  are  some  traces  of  a  consonant  ir.  the  voiced  portion, 

More  specifically  there  are  traces  of  a  consonant  articulated  v.  ilh  llie  lip 
ol  the  longue,  since  all  tile  consonants  the  listeners  mentioned  are 
tongue-lip  con.sonants  except  h. 


21 


The  results  of  this  experiment  were  quite  striking  becausejt 
is  phonetically  described  as  a  voiceless  aspirated  alveolar  stop, 
but  the  listeners  said  they  heard  in  a  speech  fragment  which  was 
voiced,  had  no  stop  gap,  stop  burst,  or  aspiration,  and  which  had 
formants  like  a  vowel.  This  conclusively  proved  that  there  are  clues 
about  preceding  consonants  in  the  actual  vowel  portions  themselves. 

The  source  of  these  clues  can  be  discovered  by  studying 
spectrographic  analyses  of  given  syllables  within  a  word.  These  a- 
nalyses  show  that  when  a  vowel  is  preceded  by  a  consonant,  the 
formants  of  the  vowel  (the  concentrations  of  energy  at  certain  fre¬ 
quencies)  show  a  fairly  rapid  change  in  frequency  at  the  beginning  of 
the  vowel.  Such  changes  are  called  transitions.  When  these  frequency 
changes  cease,  the  vowel  has  reached  a  steady-state.  Research 
indicates  that  these  transitions  vary  according  to  what  consonant  pre¬ 
cedes  the  vowel.  This  gives  listeners  the  chance  to  identify  given 
consonants  within  actual  vowel  sounds. 

Further  research  has  indicated  that  the  first-formant  (lowest 
frequency)  transitions  give  information  about  the  dimension  of  reson¬ 
ances  useful  to  our  model  while  the  second-formant  (second  lowest 
frequency)  transitions  give  information  about  place  of  articulation.  In 
all  the  experiments  described  below,  the  experimental  material  did 
not  contain  any  consonant  clues  e.xccpl  transitions  and  there  were  no 
noise  burst  to  indicate  release  of  a  slop,  yet  listeners  were  able  to 
identify  from  transitional  vowel  formants,  the  presi'nce  of  a  specific 
consonant. 

Ainong  the  studies  of  first-formant  transitions  of  svhich  we  are 
aware  are  those  which  have  been  made  at  Haskins  Laboratories  with 
the  Pattern  Playback  speech  synthesiser,  which  reproduces  speech 
tram  artificially  produced  speelrograms.  Researchers  rcporl  that  in 
Kcvcral  experiments  in  synthe sising  speech,  when  tlu:  first  furiuaut  is 
kept  level,  Ihe  most  "natural"  voiced  stops  (b,  tl,  and  g)  wre  produced 
when  tiu'  iortnanl  was  at  its  lowest  frequency.  (Dclallre,  Liberman, 
and  Cooper,  1951).  Later  experiments  (Liberman,  Delattre,  and  Cooper, 
1958)  showed  that  when  the  starting  point  of  the  first  formant  was  raised 
and  tile  start  of  the  formant  delayed,  listeners  reported  hearing  voice¬ 
less  slops  (p,  I,  and  k).  The  authors  carried  out  mure  experinu  nts  to 
separate  these  two  variables  and  eoncluded  that  a  rising  first  formant 
Is  a  cue  to  voiced  stops,  and  a  time  delay  in  lln^  first  formant  without 
rising  transition  is  a  cue  for  voiceless  slops.  C,  G.  M.  Fant  is  also 
sludying  this  aspect  of  synthetic  speech. 


Zl 


Experimenting  with  second-formant  transitions,  scientists 
at  Haskins  have  introduced  the  concept  of  the  locus,  which  is  the 
frequency  level  from  which  the  second-formant  transitions  of  a  given 
consonant  are  presumed  to  begin.  In  synthesizing  speech,  they  report 
better  results  if  the  second-formant  transition  does  not  begin  at  the 
locus,  but  simply  “points"  to  it  (Delattre,  Liberman,  and  Cooper,  1955). 
They  concluded  that  the  best  ^is  produced  with  a  locus  at  3000  cycles, 
the  best  d  at  18,  000  cycles,  and  the  best  b  at  720  cycles. 

Since  the  locus  of  a  stop  is  fixed  and  the  frequency  of  the  second 
formant  of  a  vowel  depends  on  what  particular  vowel  it  is,  it  follows 
that  the  transition  of  a  consonant  may  be  rising  before  some  vowels, 
failing  before  other  vowels  and  level  for  one  vowel.  Thus  the  transition 
for  £di]  is  rising,  for  l^dEjit  is  level,  for  **duj  it  is  falling  sharp!/. 
For  [bi]  the  transition  is  sharply  rising,  and  for  fbu]  it  is  slightly 
ri  sing. 


All  these  transitions  which  give  information  about  consonants 
occur  in  what  is  traditionally  considered  the  vowel  portion  of  the  speech 
stream.  Joo*s  tape-erasing  experiment,  moreover,  showed  that  both 
the  consonant  and  the  vowel  were  perceived  throughout  almost  the  entire 
.stretch  of  what  is  usually  considered  t)ie  vowel  portion.  As  it  is  virtu¬ 
ally  impossible  to  separate  a  con.sonant  from  a  following  vowel,  vowels 
have  been  included  as  a  separate  dimension  in  the  model  to  avoid 
segmenting  between  them, 

NON '  DISTINCTIVE  CONSONANT  DIFFERENCE  DEPENDING 
ON  FOLLOW IKG  VOWEL. 

Very  little  work  lias  been  clone  on  tins  aspc'c  t  of  consonant 
vowel  *  oinbinations ,  but  th<‘  available  data  indicate  that  frequently  a 
consonant  is  influenced  by  the  following  vowel.  Liberman,  Delattre, 

.nnfl  Cnoper  report  (!957.)  that  the  Judgeinents  of  synthetic  stop  bursts 
as  p,  t,  and  k  depended  on  Iho  freque  ncy  position  of  the  burst  in  relation 
to  the  vowc'l;  thi.s  was  especiciiiy  true  of  p  and  1^.  burst  at  high  frequency 
were  reported  as  1.  bursts  at  lower  frequencies  were  reported  as  k 
when  they  were  on  a  level  with,  or  slightly  above  the  second  formant  of 
the  vowel;  otherwise  they  were  reported  us  p.  I'ln^se  rlata  were  used  by 
Denes  and  Fry  in  the  design  and  con*  truction  of  their  phonetic  type¬ 
writer. 


One  aspect  of  the  influence  of  vowels  on  preceding  consonants 
which  requires  further  investigation  is  iabiali nation  of  a  consonant 
before  a  rounded  vowel  such  as  In  such  words  as  sue  and  too, 

many  speakers  have  their  lips  rounded  during  most  of  the  consonant 
arliculiition;  the  acoustic  characteristics  of  such  labiali^vatiun  are 


23 


not  known. 


III.  DURATION 

In  approaching  the  problem  of  the  duration  of  units  of  speech 
it  is  best  to  emphasize  our  most  basic  task:  in  this  study  we  are 
trying  to  break  normal  speech  into  patterns  that  a  machine  can  rec¬ 
ognize.  A  human  ear  can  adjust  to  variations  in  regional  accents,  even 
to  mistakes  in  pronounciation,  without  difficulty.  A  machine  must  be 
programmed  to  isolate  each  sound  from  surrounding  sounds  and  to 
identify  it.  If  two  people  pronounce  a  word  differently,  the  human 
ear  makes  automatic  adjustments;  the  transcribing  machine,  however, 
must  be  built  to  compensate  for  such  differences  by  breaking  the  word 
into  its  component  sounds. 

To  accomplish  this  difficult  task,  it  is  apparent  that  the  class¬ 
ification  of  each  sound  within  a  word  must  be  extremely  precise.  Past 
studies  of  phonetics  have  tended  to  concentrate  on  distinctions  which 
the  human  ear  can  identify  and  which  are  useful  in  distinguishing 
whole  words  from  each  other.  Often  sounds  were  measured  only  by 
ear.  These  studies  arc  invaluable  to  machine  Unguistic.s  because 
they  begin  to  identify  problems  whicli  we  now  must  solve.  On  the 
other  hand,  in  measuring  the  duration  of  sounds  of  words  they  seldom 
needed  to  break  every  sound  into  all  its  component  parts.  For  oiu 
purpose  they  arc  not  definitive. 

It  is  the  problem  of  dividing  each  sound  into  units  basic  enough 
to  be  recognized  by  a  machine  that  we  must  now  examine  in  detail. 

In  t.'iis  problem  we  face  the  question  of  separating  each  sound  in  a 
word  from  every  other  sound:  when  applicable,  breaking  each 
individual  sound  into  its  beginning  (on-glide),  middle  (steady-state) 
and  conclusion  (off-glide);  and  finally  measuring  Ihe  tluralion  of  the 
sound  at  each  of  its  stages,  (See  Figure  8). 

These  methods  of  duration  measurement  discussed  in  tlie 
following  sections  assume  Ihe  availability  of  a  reliable  formant 
tracker.  At  present  such  a  formant-tracker  does  not  exisl,  but 
available  information  requires  our  a:: sumption  that  it  can  be  developed. 
In  ease  it  seems  extremely  difficult  or  impossible  to  build  a  reliable 
formant-tracker,  the  following  discussion  would  still  be  necessary  to 
relate  information  about  how  words  are  pronounced  to  our  common 
knowledge  about  how  speech  is  heard.  If  it  is  impossible  for  our 
machine  to  identify  spoken  l.anguage  by  immediate  comprehension  of 
formant  variations,  moreover,  it  may  still  be  possible  to  recognize 


English 


Southern 

Spectrograni  of  Southern,  r'nglish,  and 
Generdl  American  Speech 
Figure  9 


24b 


a  word  by  measuring  difierences  in  the  duration  of  its  componant 
sounds  by  methods  that  cound  be  considered  later. 

Assuming  the  existence  of  a  formant  tracker,  this  form  of 
measuremeiil  would  be  one  way  to  distinguish  between  British  and 
Southern  accents,  for  example.  The  Englishman  tends  to  bite  off 
his  words;  his  on-glide  and  off-glide  are  rapid,  while  his  steady- 
state  is  comparatively  long.  The  Southerner  may  take  about  the  same 
amount  of  time  to  make  a  sound,  but  he  drawls;  his  on-giide  and 
off -glide  are  gradual,  his  steady-state  is  extremely  brief,  (See  Figure 
Measuring  speech  in  specified  lime -units  can  help  to  identify  these 
difference's  and  compensate  for  the  quantitative  information  about 
formant  levels  that  is  expected  according  to  normal  standards  of  the 
transcribing  machine.  Phonetic  differences  in  regional  speech  must 
be  indicated  by  other  criteria,  of  course. 

At  times  duration  is  the  only  method  of  distinguishing  between 
words,  as  in  "bomb"  and  "balm".  The  time  durations  of  on-glidcs 
and  off-giides  are  also  dtdermining  factors  in  distinguishing  semi¬ 
vowels  from  slop  consonants. 

Witlt  the  value  of  duration  measurement  eU  arly  in  niind,  we 
may  turn  to  the  problem  of  uliiixing  duration  in  our  model.  The 
following  subsection  will  first  di.scuss  the  work  of  past  transcribing 
luaehines  and  the  normalization  of  duration.  Next  it  will  diseuss  the 
duration  measurements  our  own  model  must  make.  Finally  il  will 
summari/,e  the  value  of  phoneticians'  past  work  lo  our  project. 

A.  S.mie  Considerations  on  Ihe  Nor inaliy.ation  of  Duration 

Aliy  discussion  of  duration  measurement  would  be  pointless 
without  a  review  of  pa.sl  work  wliich  has  at  lually  enabled  working 
niacl'tim::;  lo  recognize  itresele'' t ed  wpnlten  words,  llolli  automatie 
digit  reeognizers  and  automatic  word  recognizers  ere  examplt's  of 
this  kind  of  mat  iiine.  Alliiougb  effective  williin  a  liniited  context, 
sucli  a  mat  iiinc  uses  melliod.s  wiiieli  are  unworkable  for  a  gentiral 
purpose  transcribing  mattidne  Itir  reasons  dest  ribed  below. 

Macliincs  siicli  as  llio.sc  nu  iilioiual  above  can  at  tuaily  rccogiii  zc 
given  word.s  spoken  .at  different  linie.s  by  difftii'enl  speakers  by  stand¬ 
ardizing  or  "normalizing"  the  duration  of  each  of  these  words.  Tiic 
machine  is  given  one  pronouneialion  (and  duration)  of  fite  v.urd  "nine" 
for  example  as  its  slandard  for  deciding  whelher  a  spoken  sotind  is 
also  the  word  "nine,"  or  wiiellier  il  is  some  ulher  word.  Every  man 
does  not  pronounee  "nine"  wifli  liie  same  lime  duration.  To  compens-tle 


for  this  the  machine  will  proportionately  shorten  or  lengthen  the 
time  duration  ol  the  sound  it  hears  to  conform  with  its  "normalized" 
pattern  of  duration  for  the  entire  word. 

Such  standardization  of  normalization  of  the  duration  of  words 
results  in  proportionate  lengthening  or  shortening  of  the  on-glides,  the 
steady-state,  and  the  offi-glides  of  voiced  portions  of  speech.  It  has 
similar  effects  on  the  duration  of  silent  portions,  of  noise  bursts, 
and  of  periods  of  aspiration. 

The  end  result  of  normalization  is  that  the  digit  recording 
machine  can  recognize  the  standard  pattern  of  voiced  parts  of  a  word 
(parts  using  the  vocal  flaps,  such  as  m  )  as  they  may  be  spoken  by 
several  speakers  from  the  same  locality  and  with  a  similar  dialect. 

The  duration  of  unvoiced  parts  of  sound  (those  not  using  the  vocal 
flaps,  such  as  "s")  presents  a  different  problem,  however.  The 
durations  of  certair,  unvoiced  sounds  such  as  "sh"  can  vary  considerably 
with  each  individual  speaker  and  each  regional  dialect. 

Given  two  conditions  a  transcribing  machine  may  be  able  to 
recognize  words  by  standardizing  their  duration.  The  first  condition 
is  that  voiced  sounds  be  a  prominent  part  of  tho  pronounciation.  More 
important,  the  machine  must  have  a  limited  vocabulary,  such  as  ten 
digits.  A  third  possible  condition  could  be  that  speakers  have  the 
same  regional  accent. 

Today's  transcribing  machines  arc  able  to  ignore  small 
differences  in  p"onounciation  precisely  because  they  have  limited 
vocabularies.  To  identify  a  word,  they  compare  the  patterns  of 
pronounciation  of  whole  words  rallier  Ilian  of  their  component  sounds. 

The  words  in  a  machine's  vocabulary  represent  extremely  diverse 
wavelonos.  When  only  a  small  number  of  dissimilar  patterns  need 
to  be  identified,  the  task  is  simple,  it  is  made  even  more  easy  by 
the  fact  that  words  are  spoken  separately  into  digit  recording  machines 
rather  than  slurred  together  as  they  would  be  in  normal  speech. 

'The  general  ptirpost;  speech  recorder  cannot  worlf  tinder  those 
limitations.  It  must  liave  a  large  vocabulary.  With  a  large  vocabulary 
it  must  be  able  to  distinguish  between  very  similar  patterns  of  pro¬ 
nunciation.  It  will  not  be  transcribing  the  slow  precise  tones  of 
telephone  operators  eiiuiicialing  a  long-distance  number;  rathe,  it  will 
be  recording  language  spoken  at  its  normal  rapid  rale.  Finally,  wdien 
such  similar  words  as  "Uiree"  and  "through"  are  included  in  Llie 
vocabulary,  a  general  transcribing  machine  cannot  be  limited  by  inability 
to  recognize  units  sinaller  than  simple  word-patterns. 


A  general  purpose  recognizer  must  consider  the  sum  of  the 
sounds  as  they  combine  to  form  a  whole  word,  not  simply  the  sound 
of  the  whole  word.  The  following  section  will  thus  consider  the 
importance  of  measuring  the  duration  of  separate  sounds  as  they 
naturally  occur  within  a  word. 

B.  The  Importance  of  Duration  Measurements  to  Speech  Recognition 

In  Knglish  tiiere  arc  many  distinctions  between  individual  sounds 
that  can  be  identified  accurately  only  be  measuring  the  duration  of  a 
given  phone.  .Notable  examples  are  summarized  below. 

1.  Some  words  are  identical  except  for  vowel  duration  [balm,  bomb). 

2.  In  English  the  phonos  of  a  stressed  syllable  are  longer  than  those 
of  the  same  syllable  when  it  is  not  stressed  (the  by  itself  vs.  the 
apple). 

3.  One  differen.ee  between  voiced  and  voiceless  cuusonanls  in  English 
is  that  a  vowel  preceding  a  voiced  consonant  is  longer  than  the  same 
vowel  preceding  a  voiceless  consonant  (bead,  beat) 

4.  Some  pairs  of  similar  consonants  have  duration  differences  which 
serve  to  identify  them.  Rapid  is  distinguished  from  rabid  because 
the  slop-gap  of['p’]  is  longer  Uian  that  of[_b].  One  difference 
between  English  s  and  z  is  lhal  s  is  longer  tlian  z  , 

I.  Transitioni;  and  semivowels  are  also  diffci'entiated  by  duration. 

A  .semivowel  is  longirr  tlian  a  Iransilion  (c.ne,  coo). 

6.  'I'here  is  at  lirasl  one  .Soulliern  dialect  in  wliicli  ilie  transitions 

from  phones  to  plione  liavc  a  vivy  long  duration,  and  llu,'  stuady-stalins 
are  very  shoiM.  In  any  diaii'cl  where  liiis  is  the  case,  il  seems 
probable  tlmt  the  .steaily  ••.sl..i.Le  does  nol  riuieli  tile  freipiency  level 
it  iniglil  iiave  attained  if  llie  Iransilion  liad  been  more  rapid.  It 
may  be  ne.  essary  to  inslriu  I  tlie  nuicliint'  to  com].>ensate  for  tills 
wlieneve  v  .sueli  a  liriet  .steady  -  slate  occurs. 

7.  A  non-final  nasal  is  iiiuch  sliorter  Ilian  a  final  nasal,  and  if  a 
final  iia.sal  lias  file  dn  ration  of  a  non-final  luisaL,  iislimers  will 
report  liearing  a  voii  eless  sto|)  afti-r  it. 

H.  The  American  Englisli  vowels  1  (jiit),  f  (pet)  Q  (pull),  t/  (put)  differ 

troiii  (peat )  a-  (pal),  a  (pot)  ,  j  (lioiight),  n  (boot)  in  that  llie  former 
are  .shorter  and  they  also  liave  a  lunger  offglicle  relative  to  tile 
steady  stale. 


27 


We  will  next  consider  how  traditional  phonetics  has  analyzed 
the  speech  wave  pattern  to  solve  such  problems  of  identification, 

How  our  work  differs  from  most  past  work,  and  how  our  work  can 
utilize  the  results  of  other  studies. 

For  traditional  phonetics  duration  has  been  important  pri¬ 
marily  when  it  served  to  distinguish  one  word  or  phrase  from  another. 

Such  distinctions  are  phonemic  (word-differentiating).  In  some 
American  dialects,  for  instance,  the  words  balm  and  bomb  arc 
identical  save  for  the  duration  of  their  vowels;  because  it  distinguishes 
the  two  words,  the  difference  in  duration  is  phonemic  and  of  greater 
practical  significance  to  phoneticians. 

The  problem  of  identifying  sound  by  ear  has  tended  to  confine 
phonetic  studies  to  problems  of  phonemic  duration;  this  is  unfortunate 
for  our  present  investigation.  Although  there  is  a  considerable  degree 
of  difference  in  length  between  a  very  long  and  a  very  short  phone,  for 
cxai-.ipic,  there  is  likely  to  be  only  a  minute  difference  in  duration 
between  two  similar  phones,  as  in  "buck"  and  "duck".  There  are  many 
such  small  aifferences  in  duration,  leading  gradually  from  phones  of 
very  short  duration  to  those  of  very  long  duration.  Only  when  length 
is  phonemic  are  phones  likely  to  fall  easily  into  "long"  and  "short" 
categories.  Phoneticians  listening  to  sounds  do  not  identify  them  in 
terms  of  precise  time  units,  moreover.  Instead,  they  identify  long  and 
short  sounds  in  terms  of  how  they  are  heard  in  their  phonetic  context. 

The  vowel  sounds  in  biz  (as  in  show-biz)  and  beat  are  probably  of  the 
same  duration,  but  the  i  in  biz  seems  to  be  shorter.  For  our  model 
mo.st  of  the  finer  Ui.slinctions  are  important  in  analyzing  a  sound  because 
the  m(jdel  must  measure  them. 

Ill  addition  to  the  distinction  between  sounds  wliicli  are  phone- 
tnically  long  and  those  which  are  plionemically  short,  linguists  liave 
also  paid  special  attention  to  length  variations  which  serve  to  i  ba.'' 
aclerize  word  boundaries.  The  most  eomnionly  cited  example  for  th.is 
is  tile  contrast  between  tlie  phrases  a  nice  man  and  an  iee  man  .  According 
to  an  analysis  which  is  widely  accepted  among  lingiusls  tlie  phonetic 
dilferenecs  between  these  two  phrases  is  lliat  tlie  n  in  an  ic e  man  is 
longer  and  more  drawn  out.  Acoustically,  this  is  nut  true  (see  the 
discussion  on  nasals)  but  tlie  whole  problem  is  extremely  complex  and 
many  linguists  continue  to  use  the  old  description  because  no  cTearcut 
new  description  has  been  proposed. 

Aside  from  phonemic  distinctions  which  have  just  been  discussed, 
the  only  otlier  functions  of  duration  much  discussed  by  linguists  have 


been  the  greater  duration  of  a  stressed  syllabic  and  the  greater 
duration  of  a  final  syllable.  In  both  cases,  however,  the  duration 
variations  are  also  accompanied  by  pilch  and  intensity  variations. 

The  problem  is  thus  not  purely  one  of  duration,  and  requires  an 
exhaustive  study  not  limited  to  the  purposes  of  this  particular 
subsection. 

Wo  may  now  turn  to  tlic  actual  work  already  accomplished 
in  measuring  duration.  We  have  already  discussed  how  to  divide 
sounds  so  that  a  machine  can  measure  them.  This  problem  again 
rises  when  we  try  to  assess  the  effects  of  duration  variations,  since 
it  is  frequently  necessary  and  extremely  arduous  to  decide  tlie  specific 
point  at  which  a  particular  phono  begins  or  onus. 

Many  researchers  on  duration  have  used  criteria  for  segmentatinn 
that  are  different  from  tlmse  we  are  considering.  Hence  much  of 
the  work  we  are  now  going  to  discuss  in  this  subsection  may  have 
liinitod  application  to  our  model.  We  will  first  review  those  explorers 
whoso  work,  although  not  directly  useful  to  our  study,  lights  the  way 
to  further  research.  HoAt  \vv  will  (:(^nsider  more  recent  studies  which 
break  duration  measurements  into  tin;  same  unit.s  our  model  jjlans  to  use. 

One  of  (ho  early  and  moiu?  coinprehenvive  studios  of  English 
phone  durations  was  piibli  .shed  in  190.3  hy  IL'rnesl  A.  Meyor.  Moyer  used 
a  rubber  moutlipitnre  to  record  tin  air-pr('H.siire  variations  in  iiis  subjocl'a 
breath  while  they  H|jok<-.  lie  measured  llic  sound  durations  from  tlieso 
air-pressure  records.  He  also  measured  the  transitions  from  one  sound 
to  another  !>/  inaeluvnically  rtu-ording  the  lip  movoiuonts  of  the  speaker. 
'l'ho.s<-  );.irts  of  (he  spetrcli  piHH'oss  wliii  h  showed  rapid  lip  movtuneiU 
is  one  direction  he  called  gUcK*s. 

Meyer's  equipiiuml  was  quilt-  simph-,  but  it  is  Worlli  noting 
tlial  .several  of  his  cojie iusioi»s  have  btnui  supporlc.-d  by  more  lUJc  -nt 
reccarcli.  'Die  expi'riineni  s  of  Hones,  Ltdusle  and  Peterson,  ajid 
Sliarf  <lesc r ibetl  IjoIow  confirm  Moyer'.s  stalenn-nts  that  a  vowt-l  before 
a  (eny<‘  (voiceles.s  oral)  eonsonajit  is  shorler  Hkui  before  a  lax  (voiced 
oral)  eonsonaiil.  Deiie'.s  work  <ilbo  etnifin.'.s  Meyer's  observation  that 
a  tensi*  eonsonaul  is  longer  Hum  u  lax  one, 

Ne\ ertJiele.ss  ,  s  e  ve  ra  I  J'ac  ts  abou!  Meyer's  work  lijui!  its  value 
for  maclune  linguislics.  Meyer  liad  ojily  two  in  .formants;  both  spoke 
stiindard  Hritish  Eiiglisli.  Soim-  uf  Meyer's  results  may  siiuj>ly  reflect 
the  idiosync  rasies  of  his  informants,  Moreo\'er.  slatemeids  about 
).3ritish  EngU.sli  as  it  was  spoken  sixty  years  ago  do  not  iicu  essarily  hold 
true  for  American  Kn|*lish  Hulay.  Still  iinolher  drawbai  k  is  ihaL  tlie 
material  used  for  (his  study  consistc'd  of  one -and  -  two  -  syllable  words; 


^,9 


id 

■  I 

this  means  that  it  deals  primarily  with  stressed  or  "accented"  syllables. 

A  fourth  drawback  is  that  Meyer  used  the  terms  "tense"  and 'lax"  with¬ 
out  ever  clearly  defining  them.  It  is  possible  to  discoTcr  from  the 
text  which  phones  fall  into  each  category,  but  the  identifying  charac¬ 
teristics  of  the  various  categories  are  never  described.  (Meyer's 
terminology  and  conclusions  are  presented  in  Appendix  D.) 

Agreeing  with  previous  observations  by  Meyer,  Lehiste  and 
Peterson  (I960)  report  variations  in  vowel  duration  which  depend  on 
the  following  consonant.  They  add  that  the  relative  durations  of  ongiide, 
steady-state,  and  offglide  remain  constant.  If  this  is  true  of  Lehiste  and 
Peterson's  data,  it  is  probably  also  true  of  Meyer's  data.  This  would 
seem  to  be  applicable  only  when  measuring  accents  of  people  with  similar 
dialects,  however.  Lehiste  and  Peterson  also  agree  that  a  vowel  before 
a  voiced  consonant  (one  using  the  vocal  flaps,  as  in  bag)  is  longer  than 
before  a  voiceless  one  (not  using  vocal  flaps,  as  in  pack)  and  that  fric¬ 
atives  (s,  sh,  z,  r.h,  f,  V,  th,  h)  lengthen  the  preceding  vowel.  Lehiste 
and  Peterson  agree  with  Meyer  that  no  definite  statement  can  be  made 
about  the  effect  of  an  initial  consonant  on  the  following  vowel. 

They  disagree  with  Meyer  about  the  effect  of  nasals  on  the 
preceding  vowel.  Meyer  says  the  vowel  is  shortened  while  Lehiste 
and  Peterson  say  it  is  lengthened.  'Whenever  there  is  such  disagreement, 

Lehiste  and  Peterson's  results  may  be  more  interesting  to  us  because 
their  informants  spoke  the  dialect  w'e  are  studying. 

Investigation  by  Donald  Sharf  (1962)  suggests  the  importance  of 
the  relationship  between  some  vowels  and  tlie  duration  of  their  following 
consonants.  Sharf  recorded  word  pairs  such  as  eally-eaddy,  lacking- 
tagging  ,  and  napping -nabbing,  which  wiire  ichmlical  except  for  the 
voicing  or  voicelessness  of  the  slop  consonant  between  vowels,  or 
"intervocalic  stop.  "  He  reported  that  he  measured  tlie  relative  duration 
of  the  vow-els  l.»efore  tlie  ilifft:real  stops,  but  tie  did  nut  say  wliat  criteria 
he  used  to  make  the  segmentation  between  consonant  and  vowel.  Since 
tliure  is  considerable  overlap,  this  is  a  serious  oniis.sion. 

Sliarf  arrived  at  the  following  results.  The  proportionate  duration 
of  vowels  before  p,  or  b  is  3:4;  the  duration  before  k,  g  is  A\b',  the 
average  dur.ution  of  a  vowel  before  d  is.9es  longer  than  before  t.  Since 
this  experiment  did  not  manipulate  durations,  but  only  calculated  them, 
we  still  liavu  no  evidence  that  the  length  ui  tlie  preceding  vowel  affects 
the  perception  of  a  stop  as  voiced  or  voieuless.  When  Sliarf's  work 
and  that  of  Denes  (considered  below  under  sibilants)  are  compared, 
however,  this  seems  posstl>le  and  wortli  investigating. 

The  preceding  survey  is  valuable  mainly  for  the  lines  of 


30 


i 

I 

iiivesUgation  it  suggests,  rather  than  tor  its  direct  bearing  on  our  t 

model.  The  following  studies  are  of  primary  technical  importance  in  I 

defining  the  different  aspects  of  duration  which  our  model  plans  to  J 

incorporate.  i 

Researchers  have  used  several  different  methods  to  investigate  : 

duration.  Perhaps  the  most  valuable  series  of  studies  for  our  report 
is  that  of  Haskins  Laboratories.  Using  Pattern  Playback,  Haskins  has 
made  it  possible  to  produce  artificial  sounds  and  alter  sounds  by 
varying  one  particular  detail  of  the  speech  wave  pattern  while  keeping 
others  cnn.stant.  I(y  such  variations  it  is  possible  to  identify  phoneti¬ 
cally  significant  aspects  in  the  duration  of  the  onglide. 

A  second  method  of  experimentation  is  to  record  sounds  on 
magnetic  tape.  Significant  work  in  this  field  is  that  of  Leigh  Lisker, 
who  was  able  to  change  the  sound  of  words  by  splicing  taped  sounds 
and  varying  the  duration  of  silence  after  stop  consonants;  of  P.  IJenes, 
who  used  similar  methods  with  s  and  jo;  and  of  Richard  Harrell  who 
played  taped  sounds  backwards  to  check  the  duration  of  nasals.  An 
additional  important  method  of  measuring  duration  is  to  analyse  sound 
spectrographs  and  compare  llicni;  principal  workers  in  this  field 
are  Use  Leluste  and  Eli  Fischer- Jorgensen.  Sixcific  discussions  of 
expcu-inienls  tliat  liave  supported  those  conciusions  ca.n  be  found  in 
Appendices  D  and  E. 

The  outline  given  liirri’  merely  summarizes  various  aspects 
oi  duration  measurements. 

1)  Duration  of  Nasals.  The  rcliitive  duration  of  initial  and 
linal  nasals  is  important  in  distinguishing  helwceu  sucli  sounds  as 
bum  and  bump  .  Assuming  bulli  syllablius  rcu'cive  llie  same  stress, 

<!vid<uicc  indicates  that  final  nasals  are  lunger,  Init  then.'  is  somi' 
dispuie  about  this. 

^ )  Relative  Duration  ol  the  Onglide,  Steady-Stali:  and  Offglide 
of  Liijiiids  and  Siuiii- Vowels.  Experiments  with  Uie  PaUerji 
Playback  ijidicale  llial  r,  y,  and  w  eacli  liave  onglides,  offglides,  and 
steady  -  slat  es  of  proportioiiaUily  equal  duration.  The  sound  of  1  is  most 
easily  dislingiiislied  by  listeners  when  llie  first  forinanl  transition  is 
very  short. 

3)  Duialioii  of  Spirants  and  Sibiluiits.  Available  researcli  by 
P.  Denes  and  Ernst  Meyer  suggests  tliat  variations  in  llie  duration  of 
spirants  and  sibilants  may  be  more  important  than  vuieing  in  distinguish¬ 
ing  between  such  words  as  the  noun  use  and  the  verb  use. 


31 


4)  Duration  of  Stop  Closures.  The  listener’s  ability  to 
distinguish  between  p  and  b,  as  in  rapid  and  rabid  may  depend  on  the 
duration  of  the  stop  closure  between  vowels  in  English  according  to 
tape-recorder  experiments  made  by  .Leigh  Lisker. 

5)  Duration  of  Stop-Bursts  .  The  duration  of  noise  bursts 
may  be  the  same  for  all  stops  in  English,  but  there  is  little  definite 
information  on  the  subject  to  confirm  or  deny  this. 

6)  Duration  Between  the  Stop  Burst  and  the  Beginning  of 

the  First  Formant  .  In  tins  ‘  ategnry  duration  m f^asu rf^ment 
is  piirticularly  helpful  in  distingxiishing  between  voiced  and  voiceless 
slops  at  thjr  beginning  of  a  word  .so  that  it  may  bt;  possible,  for  example 
for  a  transcribing  machine  to  tell  llie  diffi^rencc  between  bah  and  piu 

7)  Duration  of  Vowel  Onglide,  Offglide  and  Steady-State.  These 
three  aspects  of  duration  art?  t  lost^ly  related  to  each  other,  although  each 
will  lju  measurt.d  separately  in  our  model.  As  indittaied  in  the  intro¬ 
duction,  their  relaLit>nship  to  eacli  other  may  be  parilcuiarly  valuable 

in  helping  a  transcribing  machine  to  rticogni/.e  both  a  Georgia  drawl 
and  a  Yankee  siamiiu  v. 

Experimenis  liave  shown  Ihal  wluu)  the  duration  of  the  onglide 
is  disproporliontilely  Iting,  a  semi-vowt  1  It^iuls  to  be  iieard,  after 
certain  consoiianls.  Thu.s  bat  bin  omes  bwcil.  '.rhi  s  also  explains  wliy 
Virginians  who  drawl  say  gyardt  n  instead  of  garden.  The  Soulliern 
onglicle  and  ollglide  are  disproporlionalel y  long  eomjiared  to  the  steady- 
stale  at  cording  lo  spot  trogram s  made  t>!  .Soulhern  speeelu  There  is 
some  iiulicalion  that  a  transc  ribmg  mat  him^  may  have  to  i)e  [)rog rammed 
to  compensate  for  snt  h  variations  from  slaiidard  Am(?ricati  s]>et:t  h 
palle  rns. 

C.  Metliuds  t>l  Indicating  Duraiioti  in  Our  Model 

■J*he  above  dal«i.  iin|M>rIai}t  in  Itienoudves  ,  an-  also  liie  lu  t  essary 
prelude  to  tin*  methods  oi  itniit  almg  iluraliun  u  e  expt't  t  at  [jresenl  to 
include  in  our  modeb  llasitig  uieiilified  the  lormaids  oi  separalt*  sounds 
of  a  worti  at  fording  to  a  set  crilerujo,  it  is  llien  [)OssLhk’  to  attai  k  the 
prid;b:mb  of  iinlividutil  idio&ync  rasLes  and  regit»nal  accents  that  still 
neetl  lo  l>e  solved. 

in  I'Lgure  It)  w  i*  lia\ e  taken  the  word  pit  tiinl  thvidetl  ils  st>unds 
accortling  lo  our  projM>.seit  rut*thod  for  indjeaung  duration.  The  upper 
skelt  h  reprcst'iils  the  word  as  it  v>i*ul<l  appear  t»n  a  speet  rugraph. 
Divisions  nuirke'l  by  dottet!  lines  iiulit  ali'  the  purliuiis  of  tlu*  word  we 
prt»pose  lo  nuasure  a.s  si*(jarate  units  of  tluralien. 


5Z 


Directly  below  this  diagram  we  have  drawn  and  labelled  the 
duration  units  we  would  use  to  indicate  these  separate  portions  of  pit  ■ 
Most  of  the  divisions  will  already  be  familiar  from  the  previous  section. 
Divisions  one  and  six  represent  the  brief  silence  that  precedes  the 
sound  of  a  stop  consonant,  while  divisions  two  and  seven  measure  the 
duration  of  the  unvoiced  portions.  For  convenience  each  division  is 
labeled  according  to  its  auditory  function,  with  t  representing  the 
amount  of  time  it  takes  to  speak  each  separate  portion  of  the  word. 

The  offglide  (bus  represents  the  measured  time  of  the  offglide.  The 
practical  application  of  this  lorm  ot  measurement  is  evident  when  we 
consider  the  difference  in  regional  pronunciations  of  words  such  as 
tight,  which  can  sound  soft,  normal,  or  clipped  depending  on  wliether 
it  is  spoken  by  a  Richmond  Virginian,  mid-Westerner,  or  announced 
for  the  British  Broadcasting  Corporation.  Figu-e  11  represents  the 
comparative  duration  variations  of  these  three  accents. 

In  our  model,  it  should  be  emphasized,  such  variations  will 
lie  measured  iu  time  units.  It  should  also  be  noted  that  some  extreme 
Southern  aeeeiiLs  make  phonetic  charge.s  in  the  vowels  of  tight  wliich 
are  not  indicated  in  this  si.nple  duration  measuremenl. 

Normalization  and  variations  in  formant  levels  for  pronouncing 
tlie  same  word  are  additional  problems  that  merit  attention.  Dividing 
a  sound  into  its  parts,  a  transcribing  maeliine  must  still  be  tible  lo 
account  for  vari.ilions  in  llie  time  it  lakes  lo  liear  parts.  Spee- 

Irograplis  of  Soutliern  speech  also  indicate  that  because  of  tlieir  long 
glides,  speakers  do  not  actually  reach  the  levels  of  formants  normally 
associated  with  each  identifiable  sound.  (Figure  9).  A  iransi  ribing 
machine  must  be  able  lo  recognize  that  such  variations  may  still  be 
included  in  its  di'finitiou  ot  a  sound. 

We  liavi'  discussed  how  ditferent  speakers  niigiil  pronounce  the 
same  word  witli  dilierenl  durations.  iOi.|ii.iUy  important  for  formant 
movements  aiul  durations  are  llie  comltinaliun  ot  speech  sounds  within 
a  word  and  (lie  order  in  which  Ihey  occtir.  This  phas''  of  duration  will 
be  considered  in  our  simlies  of  llie  rules  ot  euphonic  couibination  in 
Sei  lion  i. 

intensity 

intensity  of  an  acoustic  wave  is  tIeJ'ined  as  Lh.e  average  rale 
of  flow  of  energy  Ihrougli  a  unit  area  norma]  t.o  llie  direetion  of  wave 
propogalion.  For  waet:  forms  of  speech  »iiv  h  as  iiJu.straled  in 
Figtirc  IZ  the  intensity  iu  front  of  the  spe.aker's  tips  can  be  defined  as: 


31 


Intensity  -  Jt 

=  10~'^k 

Uz-ti) 


Where: 

tj  -  time  at  the  start  ot  the  interval  of  time  during  which 

the  average  acoustic  energy  in  the  waveform  of  speech 
is  to  be  measured. 

t2  =  time  at  the  end  of  the  above  mentioned  time  interval. 

a(t)  =  the  amplitude  of  the  sound  pressure  {or  that  of 

volume  velocity  or  that  of  particle  velocity)  of  the 
speech  wave  in  front  of  Ihc  speaker's  lips  as  a  function 
of  time.  a(t)  is  usually  exprerssed  in  dynes  per  square 
centimeter. 

k  =  constant  determined  by  the  physical  characteristics  of 
the  medium  in  which  the  speech  waves  propogate.  For 
air  at  a  pressure  of  781  millimeters  of  mercury  anci 
at  20  degrees  c,  l/k-41.1  dynes  .sc:eonds  per  square 
centimeter. 

Intensity  as  defined  in  equations  2.  1  and  2.  2  cun  also  be  con¬ 
sidered  as  the  energy  contained  in  a  column  p  ecnlimctcrs  long  tiial 
would  pass  through  the  unit  area  in  (t.j-t()  seconds,  as; 

Int^'nsily  l_Fi»crgy  in  a  column  p  eentiincters  long]  c  ergs  pt’r 

p  second  per 

square  eentimele 
(2.  i) 

“  P  seconds - - - - (‘i-  ‘0 

~  vfidcily  of  .souiifi  in  llu?  im'diiini  in  i:  onlinu'lt:  r  s  pt)  rsi'Lor.d. 

If  an  acou.slit:  wavo  tu  remain  vine hangod  in  am¬ 

plitude  charat  Leriblieb,  and  if  il«  peak  amplitudes  were  to  attain  a 
eonalant  level,  over  time  intervals  lonj^er  than  a  second,  then  its  intensity 
over  a  one  second  interval  would  be  the  iMU’rgy  in  tlu'  wave  during  that 
second. 


will’ re: 

^Lnd  V 


r  ,  2  ergs  per  second  per 

L  '  square  centimeter- -(2.  1) 


'■i 


^a(t)J  ^  watts  per  square 

centimeter - (2.  2) 


34 


For  waveforms  of  normal  conversational  speech,  the  uniformity 
of  time  amplitude  characteristics  cannot  be  retained  for  periods  of 
time  as  long  as  a  second;  nor  can  its  peak  amplitudes  attain  a  constant 
level  during  such  a  long  time  interval,  as  illustrated  by  Figure  12. 
Hovever,  equatirins(7 .  1)  and  (2.  3)  can  still  be  used  for  identifying 
intensity  over  a  smaller  interval  of  time  than  a  second. 

When  such  an  interval  of  time  becomes  extremely  small  (i.e.  , 

in  the  limiting  case  when  t2-ti - 0)  the  equations  referred  to  above 

indicate  instantaneous  power  in  the  wave.  Such  power,  however,  varies 
very  rapidiv'  with  time  within  any  period  of  fundamental  vocal  cord 
frequency  --  as  is  obvious  in  regions  ,  A^i  A3,  A^  in  Figure  12. 
However,  instantaneous  power  may  be  a  suitable  representation  of 
intensity  in  regions  Bj.  B3,  in  Equation  2.1  wherein  the  ampli¬ 

tude  is  relatively  constant.  In  such  a  case,  though,  one  could  also 
consider  measuring  energy  over  the  time  interval  from  Bj  to  Bj-  and 
obtaining  a  value  for  intensity  from  equation  (2.  3). 

A  similar  approach  can  be  considered  for  the  Voiced  portions  of 
speech,  as  in  region  A  of  Figure  12.  For  such  a  measurement  of 
intensity,  the  period  over  which  energy  is  to  be  averaged  is  the  same 
as  that  of  the  fundamental  voicing  frequency. 


For  accuracy  in  such  measurement  of  intensity  during  the 
voiced  portions  of  speech,  it  is  essential  to  measure  the  fundamental 
voicing  frequency  quite  aceuralely.  Moreover,  Die  integration  of 
fa  (t)J  would  rc-inain  uuchunged  over  different  voic  ing  periods  if  the 
aiuplitudes  of  their  peaks  were  constant  and  if  tlu-  tiini^  amplitude; 
c  haracteristics  of  the  sptujch  wave  rujiiuinecl  unchanged.  Such  a 
situation  ean  arise  only  when  the  spec'tral  density  of  waves  is  relatively 
stable,  vvIkmi  the  voicing  pulses  are  of  lonslani  ampUlude  and  W'hen 
the  inlegratioji  is  performed  over  a  eomplete  voicing  jieriod. 


Whni  llic  Ihrcf^condiLions  just  nieiitioned  are  nut  met.  the  value 
of  the  integral  of£a(l)]  is  know  lo  vary,  as  readily  obsurvabli;  in 
tin*  fluctuation  of  the  needle  on  the  VU  metcri's  used  for  inoaituring  specidi 
Cor  recording  purposes  or  fur  broadcasting  purposes.  That  such  a 
iluctiuition  of  the  VU  iiieU'r  indication  is  due  to  variation  of  the  ioudni^ss 
of  the  speech  is  cominoniy  recognized.  I’lie  three  conditions  mentioned 
above  pre  sent  additional  leason-s  for  such  fluctuation,  as  illustrated  by 
1  onsidering  tne  eaiculalions  of  inlensily  of  spetuh  in  regions  A,  13,  C, 


D,  ill  Figure  1^. 


Moreover,  tlie  level  of  itilcnsUy  indicated  by  the  V U  meter  or 
sound  intensity  meter  may  not  necessarily  reflect  the  effort  required 


35 


for  generation  of  a  selected  portion  of  the  speech  wave.  For  example, 
a  loud  enunciation  will  rarely  bring  the  amplitude  levels  of  unvoiced 
portions  of  speech  to  that  of  the  vowel  sound  of  less  loud  speech  sounds. 
Such  a  difference  in  the  level  of  intensity  arises  from  the  mode  of 
articulation  and  from  resonances  used. 

For  reasons  mentioned  above,  it  seems  advisable  to  consider 
alternate  methods  for  indication  of  intensity.  The  methods  we  are 
considering  take  notice  of  the  following  characteristics  of  speecli  waves. 

1)  Speech  waves  consist  of  unvoiced  portions  of  relatively  low  am¬ 
plitude  and  of  voiced  portions  of  relatively  high  amplitude. 

2)  The  amplitudes  of  unvoiced  portions  oi  speech  are  characterized, 
in  broad  terms,  by  the  manner  of  thedr  articulation,  e.  g.  ,  the 

waveforms  of  £  have  more  amplitude  than  those  of  f. 

.^)  The  amplitude  of  voiced  portions  of  speech  also  vary  according  to 
articulatory  information,  e.  g.  the  amplitudes  of  nasals  are 
usually  smaller  than  those  of  .lie  vowei  sounds. 

4)  Following  the  articulation  of  consonants,  the  amplitude  of  the 
succeeding  vowels  tend  to  build  up  to  a  peak  level  and  then 
these  amplitudes  decay  before  the  end  of  the  vowel  enunciation. 

5)  Tlie  rate  of  build  up  of  vowel  amplitudes  seems  to  be  usually 
more  rapid  than  their  rate  of  decay. 

6)  Most  unvoiced  consonant  sounds  do  not  shov/  such  time -amplitude 
cliaracti^rislics  -  notable  exceptions  bidng  the  waveforms  of  eh 

that  show  several  regions  of  increased  amplitude. 

7)  The  variation  of  tlie  ainpliliide.s  of  waveforms  of  vowel  merit 
investigation  for  establishing; 

a)  their  relation  to  the  transients  on  'speclrograms 

b)  tlieir  acoustic  correlates  with  imipiiasis  or  accent  or  wtlli 
linguistic  stress 

c)  tlu'ir  ability  to  identify  stress  on  the  consonants  lliat  precede 
or  ones  that  follow  the  vowel  sounds  with  these  variations 

Wliile  information  regarding  the  above  aspects  can  be  obtained 
from  the  "iiiingogranis"  of  speeeh.  published  by  C.  G.  M.  Fant; 
additional  stud,  of  the  tinie-ainplilude  plots  that  can  be  madi'  witii 
instruments  iiaving  .i  wider  freciuenc/  response  seem  eailecl  fur. 


36 


I.  INTENSITY 
e.  FUNDAMENTAL 
FREOUEN.CY 
3.  RATE  OF  CHANGE 

of: 

A.  FUNDAMENTAL 
FREQUENCY 

B.  INTENSITY 


INTENSITY  MEASUREMENTS  AT  INDICATED  TIMES 
FOR  ITEMS  IN  THE  BOXES  TO  THE  LEFT 


■*  *  ■* 


1.  MAXIMUM 

INTENSITY 

2.  MAXIMUM  RATE  OF 

CHANGE  OF 

A.  INTENSITY 

B,  FUNDAMENTAL 
FREQUENCY 


36a 


For  eatablishing  a  measure  of  intensity  three  methods  are 
under  consideration: 

1)  Study  of  the  spectral  density  distribution  at  various  time  intervals. 

2)  Measurement  of  amplitudes  of  the  voicing  pulses. 

3)  Measurement  of 

i)  radiated  voicing  pulses 

ii)  envelopes  of  radiated  unvoiced  portions  of  speech 

Without  discussing  in  detail  the  relative  merits  of  the  above 
methods,  the  method  (3)  seems  to  be  most  easy  to  implement.  However, 
the  validity  of  dat  .  obtained  by  any  of  these  measurements,  in  light 
of  the  special  characteristics  ol  the  speech  mentioned  before,  requires 
additional  investigation. 

Assuming  that  such  a  measure  were  developed,  and  its 
measurements  related  to  the  components  o£  enunciation,  as  discussed 
under  duration,  the  .  'pr  senlation  of  speech  in  phonetic  symbols 
would  be  as  Illustrate-  Figure  13. 

The  objective  ol  such  a  ropreseiitatioii  are  the  relent  ion  of 
information  about  stress  and  intonation  of  speech,  as  it  may  be 
important  to  interprelalion  of  the  meaning  of  the  words  recognized. 

V  FUNDAMENTAL  FREQUENCY 

Fundamental  trequency  is  the  number  of  limes  the  voi.al  flaps 
open  and  close  in  a  second.  Since  ibis  number  varies  from  lime  to 
time,  this  frequency  can  be  defined  as  the  reciprocal  of  tlie  time  inter¬ 
val  between  successive  voicing  pulses.  Tins  is  included  as  a  separate 
dimension  particularly  because  it  is  helpful  in  recognition  of  male 
and  female  voices,  in  understanding  of  differences  in  formant  levids 
of  these  voices,  in  noting  differences  in  pilch  (which  distinguish 
qut^sUons  from  stateincnls),  and  in  recording  strt^ssed,  or  "accented,  " 
and  unstressed  syllables. 

A,  The  Correlation  Between  Fundamental  Frequency  Levels  and 
Formant  Levels 

Not  all  sjoeaKers  have  the  same  formant  levels,  and  tins  must 
be  eonsider.'jd  in  constructing  a  model  that  can  recognize  all  varieties 
of  pronc.n<  iaiion.  A  woman's  .speech  formants,  on  the  average,  an: 
about  ten  percent  higher  than  a  man's,  primarily  because  her  vocal 
tract  (the  area  from  above  the  vocal  chords  to  the  lips)  is  short er,  s.ncl 
also  because  her  vocal  (laps  are  shorter  and  produce  higher  fundamental 
frequency. 


37 


By  identifying  a  range  of  high  fundamentfil  frequency  and 
correlating  it  with  a  high  formant  level,  a  speech  recognizer  could 
compensate  for  deviation  from  standard  male  formant  levels. 

B.  High  Fundamental  Frequency  and  the  Accuracy  of  Formant 
Measurements 


Because  a  woman's  voice  has  about  twice  the  fundamental 
frequency  ot  a  man's  it  has  only  half  as  many  harmonics  of  the 
fundamental  voicing  frequency  within  a  given  formant  band.  This 
lack  of  harmonics  makes  any  rneasurenient  of  formant  frequency 
levels  less  precise;  our  mcdel  recogni-.es  this  situation. 

C.  Pitch  and  Fundamental  Frequency 

"Pitch"  is  a  term  which  refers  to  a  cert.ain  aspect  of  sound 
perception.  The  cr.act  relationship  between  pilch  and  fundamental 
frequency  has  not  been  clearly  defined.  We  do  know  that  pitch  is 
closely  related  to  fundamental  frequency,  but  it  is  also  related  to 
intensity.  Recent  studies  indicate  this  to  be  true,  even  for  complex 
waves,  such  as  the  sound  waves  of  speech.  Even  for  pure  sinu  - 
soidal  tones,  fundamenval  frequency  contributes  more  to  pitch 
perception  than  intensity  does.  It  follows  that  if  pitch  variations 
are  important  in  language,  an  automatic  speech  recognizer  should 
measure  fundamental  frequency.  In  the  following  discussion  we 
will  describe  two  ways  in  which  pilch  is  important  in  language. 

D.  Pitch  as  Used  in  Speech 


The  meaning  of  a  sentence  in  Englisii  often  depends  on  whether 
it  is  heard  with  a  rising  or  falling  pitcli.  To  say  "He's  eoining"  with 
a  falling  pitch  is  to  ir  ike  a  slalement;  to  say  it  with  a  rising  pitch 
is  to  ask  a  ciucstion.  Sentences  willi  sucli  rising  or  falling  pitch  have 
corresponding  rising  and  falling  functarnenlal  frequeneies:  together 
with  intensity,  fundamental  frequency  a.  alysis  can  help  a  spetuh 
transeriber  recognizi'  such  vital  dilferem  es. 

E .  Fundamental  Frequency  and  Accent 


Tliere  is  some  evidence  that  differenees  in  fundiimemal  frequency 
serve  to  distinguish  between  "ac<  loiled"  and  "unaceented"  syllables. 
This  can  help  our  model  distinguish  between  such  words  as  the  noun 
subject  and  the  verb  subject,  for  instance. 


3« 


Dennis  Fry  (1958)  and  Dwight  Bolinger  (1958)  report  that 
frequency  changes  within  one  syliable,  while  all  others  are  held  to 
a  monotone,  have  the  result  that  the  syllable  with  frequency  variation 
is  perceived  as  stressed.  As  it  is  necessary  for  our  model  to  make 
the  same  distinctions  that  a  native  speaker  would,  this  mes-ns  of 
identification  is  particularly  important. 


39 


SECTION  3:  SOUND  CHANGE  AND  THE  MULTIDIMENSIONAL  MODEL 


INTRODUCTION 

This  section  deals  with  the  relation  of  the  various  modes  of 
physical  articulation  of  speech,  discussed  in  Section  2,  to  the  processes 
of  phonetic  change  that  occur  constantly  in  languages  discussed  in 
Section  1.  We  assume  that  among  the  constant  phonetic  variations 
that  occur  within  any  language  there  exists  a  definite  order  which  may 
be  accurately  defined  by  a  properly  inclusive  conceptual  scheme.  Such 
a  scheme  i  .s  suggested  in  part  b  .  a  careful  examination  of  the  phonetic 
variations  within  English  itself  and  in  part  by  a  broader  survey  of 
consistent  phonetic  changes  that  licivc  taken  place  in  other  languages. 

The  perils  of  over-reliance  on  the  evidence  of  phonetic  changes 
in  languages  should  ol  course  be  emphasir-ed;  in  most  cases  our  sources 
are  limited  to  written  text,  and  deductive  reasoning  based  on  available 
data  must  often  serve  in  place  ol  an  actual  knowledge  of  how  sound 
change  occurred.  Nevertheless  the  identificalion  of  such  changes  lhai 
have  taken  place  can  serve  as  a  guide  to  lurlher  orderly  analysis  of 
sound  changes  that  may  occur  in  English  during  rapid  speech.  As  a 
possible  means  to  such  orderly  analysis  we  suggest  the  concept  of 
well-defined  planes  of  articulation  that  vary  according  to  the  physical 
mooes  of  speech  production  considei'ud  previously. 

We  will  first  discuss  aspects  of  tlie  problems  which  sound 
ciiaiige  within  a  given  language  can  raise  for  our  model.  We  will 
than  evaluate  evidence  and  conceptual  approaches  helpful  in  dclermin- 
iiig  sound  clianges.  brietly  sketch  theoretical  causus  of  sound  change, 
then  consider  how  it  may  be  possible  for  our  model  to  represent  the 
sound  changes  or  variations  that  may  take  place  in  Englisli  during 
rapid  spi-cch.  Ueiiiiified  data  are  iiu  luded  in  appendict^s,  particularly 
as  they  relate  to  observed  rules  of  phonetic  cliange,  here  tabulated  for 
the  first  time  with  the  aim  ol  inlegraiin;.  phonetic,  piioncmic,  genctivc 
linguistic  and  acoustic  aspects  ol  speccli. 

Previously  we  li.ive  i  oiisidered  how  it  may  be  possibli'  to  identify 
sounds  as  they  are  articulated  individually  or  within  relatively  simple 
and  isolated  word  units.  As  the  words  or  tlu?  combinations  of  words  in 
normal  speiu  h  become  more  complex,  so  also  do  problems  of  recog¬ 
nition,  Particularly  important  is  the  idenlificatiun  of  slurs  and  dropped 
sounds  that  oecur  bolli  witliin  and  between  individual  words. 

It  IS  eomiiioii  observalioji.  lor  example,  that  sucli  plirases  a.s 
seemed  to  are  condonsed  into  seemto  in  normal  conversation,  similarly 
the  almost  disappears  from  rents.  Such  run-on  sounds  occr.ir  as  a 


40 


continuous  wave  pattern  on  a  spectrograph,  moreover,  and  the 
identification  o£  individual  word  units  within  this  pattern  requires 
that  our  model  be  thorough!/  familiarAvith  possible  phonetic  variations 
that  may  take  place  because  of  rapid  speech  or  individual  and  regional 
idosyncrasies . 

This  requireincnl  suggests  lliree  specific  needs  -  an  orderly 
listing  of  various  possible  phonetic  combinations  in  which  ''mergijig" 
of  sounds  taices  place,  a  comprehensive  conceptual  scheme  for 
defining  boundaries  between  individual  sounds,  and  logical  method 
for  making  arbitrary  divisions  between  word  units.  Proc cduraliy  our 
first  problem  is  to  define  clearly  for  ourselves  the  distinction  between 
phonetic  cliangi!  and  pholu^lic  variation;  once  tills  has  been  done,  past 
phonetic  changes  in  Indo-European  languages  provide  the  best  approach 
to  identifying  various  phoaelii'  variations  or  changes  iliat  lake  place 
tn^lay  in  English. 

In  the  follu•v^i^lg  seeti‘.)ns  we  assume  that  phonetic  varialiuiis 
in  modern  English  actually  duplicate  phonetic  changes  that  havi^  taken 
pla<  e  in  other  languages.  This  piumise  is  plausible  because  all 
Indo-European  languages  utilise  Die  sriiui?  physical  inodes  of  production. 
Our  implicit  assumption  i.s  that  phonetic  changers  ari’  governed  by  an 
identifiable  set  of  rules  ba.sed  on  j;hysical  means  of  production. 

One  possible  way  <>1'  deriving  a  table  of  phom  tic  eoinbiiuitions 
for  all  languages  thus  miglit  be  to  make  an  intensive  study  of  English; 
considering  the  vast  scope  of  such  a  task,  it  siu-in.s  luon?  conveniiieni 
to  apply  to  English  the  iivailable  evidence  on  the  phonetic  clumgeK  tliat 
have  taken  pla<  <“  in  llie  past.  It  nuiy  thus  be  feasible  to  arrange  our 
inodid  with  set  (ions  t>f  reference  devoted  \o  sptu  ial  phoiu  lic  variation.s 
and  to  places  where  a  sound  drop-oni  i.s  likely  to  occur.  Once  llie 
various  phonetic  changes  in  English  liave  been  orrlered,  moreover 
it  bee oin<  s  <  a.s i ^  r  ? <>  <U-  \  isi  .s</iue  means  i  w r  a  i  bit  r«i  r  1 1  /  s e [;a jmI 1 1 ig  the 
acoustically  undirferentialetl  wtuals  of  n<)i‘mal  <  tuiversiition. 

‘  rili:  NATURE  OK  -SOUNU  CnANGJ':S 

Sound  iintnge  is  a  gradual  .ind  contimu.l  proce.ss  vshicli  Uikes 
j»la»  e  ill  all  langiniges.  U.  is  so  gr<ul\ia)  IhaL  the  speakei’s  of  the  languagi' 
very  .sc'idtnn  nolici-  that  an-/  t  Inuige  is  taking  jdace.  Ot  Ctisionally  lh(*/ 
notice  that  the  speecli  of  llic  oldest  nietni>ers  of  the  < Diiimunil/  differs 
from  tliat  of  lliv*  iliildreii.  but  they  allrit)ute  llie  diff<' renc  *.•  to  the  effects 
ot  aging  iMllier  than  to  clianges  vvhiih  lirLVe  laiceii  plate  in  llie  language 
since  the  oldest  inhaljitanla  lirsl  learned  it.  The  oldest  inhabilanl.s 
themselves  ik»  not  iu*«tli/.e  tiuit  tlieic  speeiti  has  cluinged  since  tliey  were 


di 


children.  They  may  be  conscious  of  the  "modernisms"  in  the  speech 
of  the  younger  people,  but  they  do  not  realize  that  their  own  speech  also 
contains  recently-acquired  modernisms. 

One  reason  people  fail  to  notice  definite  sound  changes  from 
one  generation  to  the  next  is  that  they  cannot  distinguish  such  change 
from  random  sound  variaiion.  As  we  have  previously  emphasized, 
two  different  pronunciations  of  the  same  linguistic  unit  are  seldom 
identical.  When  the  word  cat  is  spoken  twice,  for  example,  each 
phonetic  unit  -  thejc  the  ae,  and  the  t  -  will  probably  differ  slightly. 

In  analyzing  sound  we  face  the  conceptual  problem  of  distinguishing 
such  random  variations  from  changes  that  gradually  alter  the  phonetic 
structure  of  a  language. 

As  an  aid  in  solving  this  problem  most  phoneticians  assume 
that  sounds  can  be  mapped  in  definite  space  with  defined  boundaries  for 
each  phone  class.  Such  mapping  is  primarily  a  conceptual  tool  for 
explaining  our  tendencies  of  classifying  phones.  Traditionally  phone¬ 
ticians  have  conceived  of  sounds  as  units  clustered  around  a  specific 
norm  or  center  of  gravity  which  lies  in  the  center  of  the  area  in  which 
varied  pronunciations  of  a  given  sound  are  most  likely  to  occur  (see 
Figure  14).  Some  linguists  describe  sound  change  as  simply  a  shift 
in  this  center  of  gravity.  While  this  explanation  gives  a  picture  of 
how  sounds  shift,  it  fails  to  give  adequate  opportunity  for  analyzing 
the  physical  nature  of  that  shift,  nor  does  it  provide  the  necessary 
frame  of  reference  within  which  an  orderly  model  for  speech  recog¬ 
nition  can  operate. 

As  a  more  priderable  way  of  representing  phone  classes  wo 
are  suggesting  a  different  approach  -  that  of  sound  grouping  within 
moveable  planes  of  articulation  (see  Figure  15).  In  our  conceptual 
plan  sounds  are  not  grouped  by  their  distance  from  a  center  of  gravity; 
instead  they  are  grouped  within  specific  hyperplanes  which  define 
their  relations  to  oath  other.  In  tin’  iwo-diiuensional  illustration 
eaeli  plane  is  approximately  equidistant  from  the  central  cluster  of 
sounds.  Any  sound  that  occurs  within  the  boundaries  of  these  pianos 
is  simply  a  sound  v.ariation.  If  the  direi  lion  of  thesis  jilanes  shifts, 
liowcvcr,  tin'll  a  sound  change  takes  place.  In  general,  these  jilanes 
may  he  related  to  the  many  physical  choices  which  a  speaker  niakc-s  in 
shaping  his  words,  and  a  shift  in  lliese  planes  may  be  equated  witli  a 
pliysical  shift  in  the  way  a  sound  is  produced.  To  indicate  how  this 
could  be  done  we  include  a  diagram  of  the  longue's  position  against  the 
alveolar  ridge  as  it  proiiouiiees  t  in  the  word  tick  (see  Figure  lb).  Today 
tile  posit'on  of  the  tongue  can  vary  bi  lween  point  A  and  point  B.  We  will 


‘U 


assume,  however,  that  in  1900  the  tongue's  position  could  have  varied 
from  point  At  to  point  beginning  at  a  point  slightl/  lower  than  A 
and  never  quite  reaching  as  far  as  point  B,  In  analyzing  this  change 
according  to  our  method  of  representation  we  would  say  that  one  of 
the  planes  which  defines  the  sound  t  has  shifted  slightly. 

It  should  be  emphasized  that  such  a  conceptual  scheme  requires 
data  which  can  be  organized  or  transformed  so  that  they  indicate  a 
definite  boundary  separating  information  about  one  phone  class  from 
adjoining  classes.  Some  studies  -  those  of  Peterson  and  Barney's 
"Control  Method.^  Used  in  a  Study  of  the  Vowels"  for  example  -  have 
reported  experiments  which  show  an  overlap  in  the  levels  of  formant 
one  and  formant  two  that  are  characteristic  of  vowel  sounds.  In  such 
eases  we  must  assume  either  that  additional  information  i\ol  investigated 
in  tlui  experiment  will  make  separation  po.ysible  or  that  for  the  vowel 
sounds  no  hyperplane  can  be  defined  with  our  prc'senL  knowledge. 

A  knowledge  of  past  shifts  in  planes  of  articulation  is  important 
lor  automatic  speech  recognition  becevusu  they  can  serve  as  a  basis  for 
|)redu*ling  whieli  phonetic  variations  will  be  favored  and  whicl\  will  not. 

We  have  many  examples  of  s  be<:oining  1^.  but  not  the  reverse.  If  a 
speaker  produces  a  word  containing  a  sound  lialfway  between  s  £uid  h, 
we  must  instruct  ll\e  inacliine  to  look  up  a  word  an  s  in  ll'.ol  position, 
rather  than  an  h. 

Mu;M  linguists  today  re<  ognize  two  types  sound  changi:.  'Fhese 
are  coiKlilioned  and  uncoiulitioned.  Conditioned  sound  change  differs 
from  unconditioned  in  lluil  condiliom’d  change  takes  platu*  uiuk^r  certain 
c  Lrcunuslances  wlnle  unconditioned  change  laKtJS  pUii'c;  under  ail  circum¬ 
stances.  During  the  ninelcentli  ccnlui'/  linguists  also  believed  in 
sporadic  sotind  c  hange,  but  this  (lu‘ory  was  attacked  «uid  dist-arded, 
because  it  impli«.*d  tliat  there  is  no  pattern  of  sounds  in  language,  and 
that  there  is  no  iiinil  lo  lii<‘  mimijci*  of  signi  fie-intly  dUfei  eul  svnii‘idf>  whiv  h 
can  exist  sin‘iiiitauei>usly  in  a  language.  Cairrenl  lingui.slic-  ilu'ory  is 
based  t>ri  the  assumption  tliat  evc*ry  language  has  its  own  S'>uhd  pattern 
and  that  there  m  a  liiniUul  number  of  );huiiemes  or  signifieantl  y  different 
st)ii  ntt  s , 

Of  till-  lwi»  types  of  sound  «.  liange  wliieh  are  eurrenlly  recogni  zitI, 
contUtioiie(i  is  llie  more  eomiiton  beeause  in  iiddilion  lo  such  lases  as 
"Greek  H  bei  ariie  li  in  initial  position,"  this  ccilt.'gory  also  includes 
all  c.a.s(*s  in  wiiicii  a  sound  is  made  more  similar  l<.>  an  £iiljac<.-nt  one, 

Thi.s  spet  lal  case  is  called  assimitatit>n,  and  w  e  wLiL  iliseuss  it  in  di-laii 
below.  UnrondllioiK^d  change  involves  such  eases  as  "  Proto-ln<lo-lLLiro- 
pean  g  ln*caine  Germanic  k.  "1'hus  Latin  gcuiu s  and  English  kin  are  cognate 


(derived  from  the  same  Proto-Indo-European  word).  Information 
about  non-asaimilatory  sound  changes  of  this  type  permit  us  to  predict 
sound-variations  which  may  occur  occasionally  in  any  individual's  speech. 
Information  about  assimilatory  changes  will  permit  us  to  predict  the 
mutations  which  occur  when  specific  sounds  are  adjacent  to  each  other. 

Assimilatory  changes  seem  to  be  the  result  of  a  strong  tend¬ 
ency  to  simplify  the  motions  of  speech  articulation.  This  tendency 
to  simplify  one's  movements  is  a  powerful  cause  of  sound  change.  In 
English  the  suffix  for  the  past  tense  was  at  one  time  pronounced  d 
in  all  environments.  It  is  still  pronounced  that  way  after  an  alveolar 
stop;  tasted  is  an  example  of  this.  Elsewhere,  the  vowel  was  lost, 
and  if  the  verb  stem  ended  with  a  voiceless  sound,  the  d  was  replaced 
by  t.  Thus  the  past  tense  of  lack  has  a  ^  where  the  past  tense  of  lag 
has  a  d.  The  d  of  lacked  was  replaced  by  t  to  save  the  speaker  the 
trouble  of  changing  his  vocal  flap  adjustraem  during  the  articulation 
of  the  consonant  cluster.  The  process  of  changing  the  first  of  two 
consonants  while  leaving  the  second  unchanged  is  tailed  anticipatory 
or  regressive  assimilation.  The  process  of  changing  the  second 
consonant  and  not  the  first  is  called  progressives  assimilation.  Most 
pleoncticians  agree  that  anticipatory  assimilation  is  the  most  common 
type  in  English.  When  one  word  ends  with  a  voiced  consonant  and  the 
next  begins  with  a  voiceless  one.  the  final  consonant  may  become 
voiceless.  Conversely  if  the  final  consonant  is  voiceless  and  the 
adjacent  initial  consonant  is  voiced,  the  final  consonant  may  become 
voiced.  Thus  thi!  phrase  big  pit  may  be  pronounced  willi  a  k  at  the  end 
of  big,  and  the  phrase  ihick  bit  may  be  pronouneecl  with  a  g  at  tlie  end 
of  tiiick  .  The  reason  for  sound  changes  of  this  type  are  cjuile  clear,  and 
we  could  predict  many  of  them  even  if  we  did  nol  have  examples  from 
oilier  languages. 

Ill  addition  to  sound  changes,  there  is  a  special  class  of  linguists 
eluinges  whirh  are  neiiher  gradual  U'>r  regular,  ,Soni.‘  linguists  call 
tliem  sound  cliaiiges  also,  but  we  prefer  to  reserve  lliis  term  for  the 
regular  gradual  process  described  aboviu  The  special  category  wliicli 
we  are  now  discussing  includes  cUssiiiiilatiou,  dislaiil  asslniilution 
melatliesis,  and  liaplology.  Di ssi mi lal ion  is  the  replacement  of  one 
sound  by  anuUier  wlien  the  original  soiiiiJ  occurs  twice  williiii  a  word. 

Tlic  Latin  word  peregrinus  (pilgrims)  becaiiie  pulegriiius  when  the 
first  r  was  replaced  by  1.  Distant  assiniilaliou  is  essentially  the 
reverse  of  this  process.  If  two  sounds  in  a  word  are  nol  similar, 
one  .sound  will  sonietinies  be  replaced  by  another  which  is  more  similar 
to  tile  remaining  sound.  The  Prof  o-liido-Eu  ropuan  word  for  five  was 
probably  'tipenkwe.  (Tlie  asU’risk  indicates  that  we  liavi  no  wriltcMi 


44 


records  which  show  this  word.  )  In  Pre -Germanic  this  became  *pempe, 
which  in  turn  became  Metathesis  is  an  exchange  o£  position  b/ 

two  sounds  within  a  word  or  phrase.  The  Old  English  word  I’or  'wasp' 
was  waep.3.  When  we  compare  this  with  the  modern  word,  we  see  that 
the  s  and  the  p  have  changed  places.  Haplology  is  the  dropping  of  sound 
or  group  of  sounds  which  occur  twice  within  a  word,  The  Latin  word 
nutrix  'feeder,  nourisher'  comes  from  an  earlier  •'•nutritrix. 

.^1  of  these  linguistic  changes  can  bo  seen  in  the  slips  of  the 
tongue  of  contemporary  speech,  but  at  present  we  do  not  plan  to 
include  them  in  our  model  because  they  arc  too  unpredictable  and 
because  when  a  speaker  makes  a  slip  of  this  type,  he  is  quite  likely 
to  notice  it  and  correct  it  himself. 

There  are  five  different  methods  for  discovering  what  sound 
changes  have  taken  place  in  a  given  language.  They  are  [1)  analysis 
of  the  regular  phonetic  alternations  of  the  language;  (2)  comparison  of 
the  descriptions  by  different  phoneticians,  each  describing  the  spt:ech 
of  hi.s  own  day;  (3)  examination  of  written  records  and  poetry;  (4)  i:om- 
parison  of  different  modern  dialects  of  the  same  language;  (5)  c:omparison 
of  the  written  records  of  ancient  languages  in  order  to  reconstruct  their 
parent  language.  Specific  discussions  of  each  of  tins  types  of  reconstruc¬ 
tion  may  be  found  in  Appendix  F. 

Our  cril<!ria  for  deciding  the  accuracy  of  a  phonetic  reconstruc¬ 
tion  include  the  number  of  different  reconstruction  icclinicjiies  which 
yield  tl\is  result,  the  phontdic  probability  of  a  given  change  having  taken 
place,  and  the  number  of  liiTu^s  this  change  has  been  reconstructed  for 
otlu;r  languages.  If  several  different  reconstruction  techniques  all 
indicate  that  a  certain  change  tooK  place,  this  is  very  strong  evidence 
that  it  really  did  happen  that  way.  For  examt)le,  the  sound -tdtange  of 
the  English  past-tense  suffix  i.s  allesled  by  three  sources  -  plionelit 
cU' sc riptions  from  Ihi*  eighteeiill>  century,  ilu:  t*videnc(;  of  spidling,  and 
II. morphophonemii  alternation  of  modern  Phiglish.  We  have  speech- 
manuals  from  tiu?  eigliteentli  century  inanuscript.s  whielj  spell  this 
suffix  as  ed.  'l  ,  eind  ‘d  .  Finally,  we  l»ave  llu*  modern  morpho- 
(hlioneinic  tiltiTnalion,  which  is  most  easily  explained  by  the  assumption 
ttial  Itic  suffix  was  originally  e<l  .  Taken  together,  lliis  evidence  leaves 
no  room  for  doubt. 

Plioindic  proliol.jiUty  is  U^e  second  criterion  for  weighing  the 
accuracy  of  a  reconstruction,  ’rhis  crili*rion  is  frequently  applied 
while  ih<;  linguist  is  worKing  on  the  reconsl  ruction  rather  than 
afterwards.  In  our  analysis  of  ihe  past-tense  suffix,  we  rejected  the 
assumption  that  d  was  tlu*  original  sviffix  beeausi*  it.  involved  tlie 


4  5 


assumption  that  ‘I’tastd  was  once  the  normal  form,  and  this  is  phone¬ 
tically  improbable.  Some  linguists  have  objected  to  the  traditional 
reconstruction  of  Proto-Indo-European  because  it  contains  voiced 
aspirated  stops  d' ,  g*  but  no  voiceless  ones  p',  t',  k'.  The 
articulation  of  voiced  aspirated  stops  involves  more  glottal  0.djustments 
than  does  the  articulation  of  voiceless  aspirated  stops,  and  since  there 
was  supposedly  only  one  set  of  aspirated  stops  in  the  language,  there 
was  no  need  to  go  to  the  extra  trouble  of  making  them  voiced. 

The  third  criterion  for  reconstructed  sound  change  is  whether 
the  sound  change  has  been  independently  reconstructed  for  other 
languages.  The  change  of  d  to  1  has  been  reconstructed  for  Latin 
and  Sanskrit;  there  is  also  alternation  in  the  Greek  dialects  between 
the  names  Olysseus  and  Odysseus.  If  this  change  should  be  recon¬ 
structed  for  another  language,  wc  would  not  question  it  even  though 
at  present  we  do  not  understand  how  this  change  takes  place. 

From  our  present  evidence  we  may  make  two  assumptions,  'i’lie 
first  is  that  sound  change  is  a  gradual  and  regular  proccijs  with  orderly 
characteristics  of  transition.  A  coroilary  of  this  assumption  is  our 
belief  that  an  orderly  system  to  deform  this  change  is  both  po  ,.ble 
and  necessary.  One  further  aid  to  the  devcTopmout  of  such  a  systenn 
would  bo  a  workable  theory  of  sound  change;  previous  attcr.ipls  to 
develop  such  a  theory  are  discussed  below. 

II  THE  CAUSES  OF  SOUND  CHANGE 


Many  tlicories  have  been  proi)osed  as  to  the  catises  of  sound  change, 
l)ul  most  of  tlioin  have  bcu'ii  tlioroughly  discredited.  For  the  puri)OBeB  of 
an  automatic  speech  recogniv/cr,  howevt'r,  a  valid  theory  of  tl>c  cniises 
of  sound  cliange  would  be  a  vtiluablo  aid  in  predicting  sound  variations, 
because  it  sviggosts  bnth  the  probable  direction  of  sound  variations  as  they 
occur  in  normal  speech  and  the  p.arlicular  sound  varialit^ns  likely  to  lake 
place  wiien  two  given  sounds  come  logetlier,  as  previously  indicated.  To 
be  of  use  in  this  study  such  a  theory  must  iiu  et  at  least  threev  qualifications 
it  must  use  physical  moans  of  articulation  as  ciie  of  il.s  major  criteria  for 
change;  it  must  be  sufficienlly  coiiipreliensi ve  to  allow  an  interplay  between 
tlie  various  tji'.ysical  modes  of  production  already  disiiissed;  and  it  must  be 
able  to  be  sljt  'd  in  uiiils  eompridieiisil)!!'  to  our  mode). 

The  must  eomnionly  proposed  theory,  is  tliat  sound  chiinge  is  a 
simplification  of  t lie  articulatory  process,  ^I'his  is  oliviously  true  of 


ii, 


some  case.fi,  but  obviously  not  true  of  others.  The  change  of[litk^  ml 
to[lnk^mJiB  a  simplification,  since  it  reduces  the  number  of  neces¬ 
sary  articulatory  movements,  but  the  change  of  Proto-Indo-European 
^to  Germanic  p  does  not  seem  to  be  a  simplification.  Moreover 
if  the  change  of  t  to  p  were  a  simplification,  the  change  of  p  back  to 
t  in  the  Scandinavian  languages  would  be  the  opposite.  The  theory  that 
all  sound  change  is  simplification  does  not  fit  the  facts. 

Other  attempts  have  been  made  to  explain  sound  change  as  the 
result  ot  a  change  in  environment  or  way  of  life,  but  it  has  always 
been  possible  to  cite  groups  of  people  whose  languages  did  not  undergo 
similar  changes  though  they  lived  in  similar  environments  with  a 
similar  way  of  life. 

Inherent  in  all  our  work  to  date  are  the  assumptions  that  there 
is  a  related  order  in  all  language  based  on  physical  modes  of  production, 
and  that  such  an  order  may  be  graphically  represented  by  examining  the 
interaction  of  these  physical  mode.s,  A  further  assumption  is  that  sound 
change  is  not  random  hut  protu'cd.s  along  a  specific  pattern  according 
to  cause  inherent  in  the  structure  of  the  language  itself.  Postulating 
the  existence  of  both  an  orderly  change  in  speech  and  an  inlierent  order 
governed  by  physical  means  ol  production,  it  appears  worthwhile  to 
review  data  concerning  the  possible  existence  of  an  orderly  series 
of  rules  for  anticipating  sound  change  in  Europi^an  languages,  particul¬ 
arly  as  it  relates  to  lhc!se  physical  means  of  production. 

Historically  researchers  havt'  accomplished  eoniparatively  little 
definitive  work  in  problems  of  predicting  sound  ehaiigi'.  Existing  tlieories 
which  assume  there  is  a  single  c.aust;  for  eltange  liave  generally  ueen 
disproven,  when  further  resertri  li  disi  losed  a  situation  in  wliieh  the 
special  cause  was  present,  hut  Hie  expected  change  did  not  oeeiir. 

Andre  Martinet,  howi'Ver,  assumes  tiuil  several  factors  influence 
sound  change:  factors  inherent  in  llu'  physical  production  of  language. 

For  this  reason,  and  also  for  the  purpose  of  obtaining  a  modern 
Western  linguislii'  view  of  sound  change,  we  shall  briefly  review  tile 
work  of  Martinet,  And  finally,  aitlKiugli  it  seems  to  )iresenl  in  an. 

I  orderly  Casliion  many  postulates  similar  to  those  on  which  our  own 

model  is  based;  at  the  same  lime  it  reveals  many  of  the  limitations 
of  lui'i’ent  linguistic  theory  when  applied  to  sound  change. 

Martinet's  llieory  is  based  on  tiu'  phonemic  theories  which  liave 
been  developed  by  many  dilferent  linguists  over  tile  past  forty  years. 

Ills  unigue  eontribuliou  is  to  combiiu'  these  concepts  into  a  Iheorv  of 
sound  change.  Tlie  theory  slates  that  many  of  the  causi  s  of  the  sound 

I  cliaiig^-s  wliicli  take  place  in  any  particular  language  arc  iiiiieivut  in  the 

i 


'17 


phonemic  pattern  o£  that  language,  and  in  the  distinctive  features -each 
of  which  corresponds  to  one  or  more  of  the  physical  means  of  produc¬ 
tion.  Thus  by  carefully  examining  the  pattern,  we  can  suggest  which 
changes  are  likely  and  which  are  not.  (A  phoneme,  as  previously  dis¬ 
cussed,  is  a  class  of  sounds  which  do  not  contrast  with  each  other  but 
which  contrast  with  members  of  other  phonemes.  A  distinctive  feature  is 
a  sound  quality  which,  alone  or  in  combination  with  other  distinctive 
features  serve  to  characterize  a  phoneme.  The  differences  between 
Martinet's  terminology  and  that  of  Jakobson  are  more  fully  considered 
below). 

One  limitation  in  applying  Martinet's  theory  to  our  project  is  that 
distinctive  features  vary  from  language  to  language;  thus  each  language 
analyzed  in  Martinet's  terms  must  receive  special  attention  to  determine 
precisely  what  its  distinctive  features  may  be.  Such  a  theory  may  thus 
be  helpful  in  developing  a  set  of  postulates  that  govern  possible  sound 
shifts  within  a  single  language. 

Figure  17  represents  the  distinctive  features  of  Ifnglish  con¬ 
sidered  in  terms  of  Martinet's  work  according  to  the  units  of  our 
model;  there  are  four  places  of  articulation  that  can  serve  to  distinguish 
j>hvncnies  -  labial,  dental-alveolar,  paial<<l,  and  r/uttiirai.  The  initial 
constuuuHs  of  pin,  tin  ,  shin,  and  kin  show  these  different  subdivisions 
of  place  of  articulation  which  forms  one  of  the  lliree  main  axis  in  our 
model  dial  serve  to  dcfin<?  how  sounds  are  produced.  11  is  also  possible 
to  grapli  acldllional  distinctive  features  under  rc•sone^nees  and  possibly 
mider  iruinner  of  arU<*ulalion.  In  X'.nglish  these  contrasts  in  sound 
serve  to  convey  different  meanings;  ti\us  W(i  say  that  tiu‘  features  are 
dlstLiietive,  On  the  other  hand,  t!u'  word  tin  would  be  reeugnizabh; 
wlietliei*  llu'  initial  eonsorie.nt  were  articuiatcjd  against  the  teeth,  the 
alveolar  ri<.lg  '.  or  th<-  palate,  lii  English,  therefore,  the  two  positions 
ol  arm. uiation  Cor  the  fioul  part  of  t!\e  longvie  are  not  i^y  ihcmst'lves 
disUiu  live  fiuLtures.  In  oilier  lungiuiges,  howi’ver,  Ihe  nuinbc'r  of  siu.li 
features  may  !><•  gnuiter  or  less;  Indian  languagefi,  for  exainpl<’,  treat 
the  dental  and  alvcrolar  t's  as  separati.'  phonenu-.s. 

In  ijssuining  lliat  the  distinctive  fctiluri  s  in  eacli  language  can 
modify  sound  siiifts.  Martinet  relies  on  the  hypothesis  that  sound  change 
IS  likely  to  o<  cur  in  Ihose  cases  when  a  language  already  uses  ail  the 
jjliysical  means  of  eirliculation  tu-cessary  to  produee  a  i>artic:ular  sound 
l)ut  lacks  tlie  suun<l  itself. 

Further  assuining  lhal  iill  speech  is  based  on  a  tension  between 
the  need  for  i-xaet  iiieaiiing  and  the  dcsiie  to  ininLinizc  exertion  in 
pliysieal  a  r(  iculation .  Martinet  suggests  that  such  change  is  more  likely 
to  lake  place  within  the  range  of  a  dislinetive  group  than  across  a  boundary 


48 


Bilabial-  Labiodental 


iDental- Alveolar 


Stoy  s 


Voiceless 

Unaspirated 

Aspirated 

Voiced 

Unaspirated 

Aspirated 

Spirants 

Voiceless 

Unaspirated 

Aspirated 

Voiced 

Unaspirated 

Aspirated 

Nasals 

Voiceless 

Unaspirated 

Aspirated 

Voiced 

Unaspii’ated 

Aspirated 

Sibilants 

Voicelosss 

Unaspirated 

Aspiratcnl 

Voiced 

Unaspirated 

Aspirated 

Affricate  s 

Voiceless 

Unaspiratoil 

Aspirated 

Voiced 

Uiiaspirated 

Aspirated 

T  laterals 

Voic  e  le  s  f! 

Unaspirated 

Aspirated 

Voiced 

Unaspirateil 

Aspii’ated 


Palatal! 


Guttural 


Figure  I  r  Tills  chart  represents  the  nurinal  or  most  coiuir.on  articulation 
of  each  Anierican  Fnglisii  consonant  plioneino. 


IH 


between  distinctive  features,  since  a  shift  in  sound  from  one  distinctive 
group  to  another  cound  make  homonyms  out  of  two  distinct  words. 

Thus,  in  the  chart  of  Figure  17  a  sound  shift  might  occur  between 
an  alveolar  t  and  a  palatal  t,  but  it  is  much  more  likely  to  take  place 
between  an  alveolar  t  and  a  dental  t ,  which  would  share  the  same 
distinctive  features. 

While  Martinet*s  theories  may  be  relevant  in  suggesting  potential 
sound  shifts,  there  is  some  question  whether  they  are  comprehensive 
enough  to  include  many  of  the  aspects  of  speech  production  necessary  to 
the  development  of  a  multi-purpose  recognizer.  Certainly  the  extended 
scope  of  our  model  precludes  complete  assumption  of  his  theories  as 
a  basis  for  organization  of  sound  changes  relevant  to  a  genei*al  purpose 
recognizer. 

The  problems  of  vowel  col»>ration  and  observed  crossing  of  the 
boundaries  of  distinctive  features,  in  addition  to  the  necessity  of  re¬ 
defining  distinctive  features  for  each  language  also  suggest  the  need  for  a 
more  general  analysis  of  predictable  sound  change  than  presently  exists. 
Such  analysis  might  subsume  Martinet’s  theories  as  additional  data  .. 
instructing  a  general  transcribing  machine  v/hat  sound  shifts  arv  mui'v- 
likely  to  occur.  While  no  such  analysis  for  modern  English  exists  in 
terms  which  can  1  *  used  by  the  niulli-dimcnsional  model  svhich  we  have 
developed,  several  factovK  argue  tiiat  it  may  be  c:rc!ated.  The  first  is 
oiu‘  assumption  of  an  inherent  03*der  in  all  speecli  directly  related  to 
piiysical  moans  of  production.  Tlu*  second  is  the  dominunl  theory  of 
modern  linguistics  that  sound  change  is  not  r:.ndom;  and  the  third  is  the 
tools  of  genetive  plionelic,  |)hone»)ue,  linguistic,  and  acoustic  analysis 
of  sound. 

ICxperiments  by  Fry  and  Denes,  moi’eover,  indicate  that  additional 
investigation  is  required  to  correlate  tlie  work  of  pJiunenue  and  acoustic 
analysts  with  tlie  objectives  of  our  inoclel.  Sound  experiments  by  llu-se 
researchers  rev(*al<’d  that  their  mat  hlne  rli  fferenl  i;itinn  l.vlw  een  t!»e  k 
of  cook  and  tlie  t  of  lick  was  over  90%,  but  differentiation  between  the 
k  of  kick  and  the  of  took  was  less  lliaji  ^v5%,  'Dus  would  st'em  to  iiidicatt' 
that  and  t  can  be  distinguisluMl  by  disLim  live  featiiri's,  but  their 
acoustic  featuri’s  may  not  ri‘tain  such  «  haraeteiisii.  o  at  all  liiiK'S, 

’J’lie  Icisk  of  orih'ring  such  aeouslie  ckita  a  comprehensive 
lhei>ry  of  sound  change  si'ems  feasible  in  Urrms  of  our  inoilel  parti¬ 
cularly  because  we  must  ot  neci-ssity  aee(jun1  for  all  tlie  physical  modes 
of  production  which  are  assiuiu'd  to  provitle  tlu“  basis  lor  directed 
sound  shifts.  In  terms  of  our  dimensions,  for  example,  it  is  definitely 
indicated  that  the  acoustic  characteristics  of  k  in  kick  differ  from 
lhos<'  of  k  in  <  ook  because  the  place  of  ar ticulation  is  invariably 
influenced  by  succe(‘ding  vow'el  sounds,  'fliis  reeinpliasizes  the  in>- 


49 


portancR  of  consonant-vowel  combinations  on  an  automatic  speech 
recognizer. 

In  the  following  discussion,  we  first  consider  sound  change  in 
terms  of  our  multidimen-sional  mode.  We  then  attempt  an  orderly  pre¬ 
sentation  of  certain  predictable  rules  of  sound  change  or  euphonic 
combination.  Some  of  these  rules  we  derived  from  our  studies  of  historic 
sound  change  in  Indo-European,  Germanic,  Old  Icelandic,  and  Celtic 
languages.  Others  were  suggested  by  the  sandhi  rules  of  Sanskrit, 
which  lend  themselves  quite  readily  to  a  systematic  analysis  of 
preferential  sound  shifts. 

Sandhi  rules  are  of  a  special  value  because  they  comprise  an 
integrated  chart  of  euphonic  combination  developed  for  a  language  that 
in  some  cases  represents  phonetic  sounds  quite  precisely.  For  example, 
although  English  makes  no  distinction  between  a  dental  and  an  alveolar 
I,  Sanskrit  has  phonetic  symbols  for  both  these  sounds  and  regular 
rules  for  the  euphonic  combination  of  each  symbol  with  other  sounds. 

Other  sounds  not  distinguished  in  English  but  specially  represented  in 
Sanskrit  include  visarga  vowels  and  aspirated  conson.ants.  Since  the 
Sanskrit  alphabet  represents  an  orderly  grouping  of  possible  phonetic 
sounds,  and  these  sounds  in  turn  arc  based  on  the  physical  means  of 
articulation  common  to  all  men,  it  would  appear  helpful  to  apply  the 
rule's  of  sandhi  to  our  own  development  of  an  orderly  method  of  speech 
analysis. 

Ill  SUolIO  CHANGES  CONSIDERED  IN  TERMS  OF  THE  DIMENSIONS 
OF  OUR  MODEL 

The  purpose  of  Uiis  section  is  to  examine  iiow  sounds  may 
change  in  contact  with  other  sounds  in  rapid  connected  si>eeeli;  a 
model  for  speech  transcription  must  be  able  to  relate  these  changes 
to  the  "purer"  patterns  of  careful  speech  by  recoil rst  to  tlie  various 
subcategories  in  adjoining  columns,  rows,  or  Heimann  leaves. 

I'lie  reader  will  note  that  the  [our  main  problems  to  be  solved 
are  emphasis,  regional  dialects,  slurring  of  syllahles,  and  definition 
(  t  word  boundarie.s.  All  are  particularly  important  because  there  is 
gcneratly  no  identifiable  acoustic  break  between  words  in  rapid  speech. 
Without  means  of  distinguishing  between  the  sounds  of  separate  words, 
however,  the  construction  of  a  general  purpose  transcriber  becomes 
almost  impossible.  The  following  discussion  provides  for  the  first 
time  an  orderly  ap|>roacli  to  the  solution  of  tliis  question. 

In  the  following  subsections  we  describe  some  sound  changes 
which  have  taken  place  in  the  past.  This  is  not  a  complete  list;  the 


50 


compilation  of  a  complete  list  would  require  years.  A  recent  book 
(Language  and  History  in  Early  Britain,  by  Kenneth  Jackson)  devotes 
three  hundred  pages  to  a  concise  description  of  sound  changes  of  the 
Celtic  languages  alone.  This  is  simply  a  sample  of  the  total  number 
of  sound  changes  which  have  been  described  by  linguists. 

A.  CHANGES  IN  MANNER  OF  ARTICULATION 


Changes  in  manner  of  articulation  involve  shifts  from  one 
Reimann  leaf  to  another  in  our  charts.  This  type  of  change  is  quite 
common.  It  occurs  both  as  an  assimiiatury  change,  as  when  Latin 
pf  became  ff  ,  and  as  a  non-assiinilator y  change,  as  when  Proto -lnd{> - 
European  £  became  Germanic  f  in  almost  all  environments.  In 
modern  English  a  cluster  of  alveolar  stop  and  [yj  frequently  becomes 
an  affricate.  Thus  did  you  becomes  dijuand  at  you  becomes 
(This  involves  a  change  in  place  of  articulation  as  well  as  a  change  in 
manner).  Most  clianges  in  manner  of  articulation  which  hav('  been 
described  by  American  phonetician.s  involve  clustei’s  wLlli  I’yJ  .  We  do 
not  know  whether  this  Ls,  in  fact,  the  most  conimon  clumge  in  manner 
of  articulation  or  whether  it  is  simply  the  most  conspicuous.  Rultfs 
for  this  type  of  change  are  incliidt  c]  as  Appendix  11.11. 

B.  CHANGES  IN  PLACE  Ok'  ARTICULATION 


Changes  in  place  of  articulation  involve  sldft.s  from  one  row 
to  another  in  our  charts.  Mi»st  changes  in  place  of  arliciilalLon  arc 
ashimiiatory ;  they  occur  only  when  two  consonanl.s  arc  adjacent,  but 
a  f<!w,  svich  as  th<‘  cliange  of  Old  English  final  m  to  n,  arc  non- 
tissiinilatory.  One  assimilatory  change  is  the  cliange  of  '.sj  to  ti  J 
wluin  it  is  fuIlow<‘d  by  fj  ]  .  Tliis  commonly  occurs  in  the  word 
liorseshoe,  wliich  is  usually  pronoutu  ed  or[_loi'5ul.  \n  the 

Appendix  H.  1  we  iiickule  a  list  of  clumge.s  in  place  of  aiTienlation  whieh 
iiave  iak<‘n  plac<-  in  the  past,  together  with  example.s  <.>1  JsngliLili  words 
and  phrases  <-oiuainmg  the  saiiu-  souiul  ctunbiiialLon.s. 

C.  RESONANCES 

Cliang<’s  in  la* ^;ona5H  i- s  inv<»Ke  .shiUs  iruin  one  i  ulumn  to  another 
in  our  eliarls.  Cluingi-s  in  resonance  repri-senl  the  most  conimon 
assiiiii laltjry  soiiiui  <  hange.  Consonants  witli  dirf».'ri'nt  plact-s  and  manners 
of  articulation  occur  next  to  eaeli  otiu’r  in  the  words  cd'  many  languagi  s, 
but  it  is  unusual  to  have  voiced  am!  \M.>Lceli.'ss  consonants  fLttjacent  to 
ccAch  olluT.  Almost  iill  languages  will  pern. it  sound  combinations  at  word 
Ixntiularu's  whieh  they  will  not  perniil  wittiin  a  word,  but  Die  La(  t  lliat  a 
certain  ly^'pe  of  stuinrl -cunnbinalion  Ls  uncommon  in  the  words  of  any  Lan¬ 
guage  seem.s  to  iiulicate  that  the  same  combijiation  may  bn  frequently 


51 


modified  at  word-boundaries  also. 

There  are  conditions  other  than  assimilation  which  cause  changes 
in  resonance.  In  standard  German,  no  word  may  end  with  a  voiced 
stop,  spirant,  or  sibilant.  The  phone  becomes  voiceless  in  that  position. 
Thus  bunt  ’bright'  and  Bund  'group'  are  homonyms. 

Some  changes  in  resonance  are  unconditioned;  that  is,  they  take 
place  under  any  circumstances.  Thus  Proto-Indo-European  ^  became 
Germanic  t.  English  two  and  Latin  duo  are  cognate  (go  back  to  the  same 
Proto-Indo-European  word). 

In  Appendix  H.  Ill  we  list  some  of  the  changes  in  resonances  which 
have  taken  place  in  various  languages. 

D.  HOW  SOUNDS  DROP  OUT 

There  are  two  different  types  of  sound  drop-out.  One  is  the 
loss  of  a  sound  from  a  certain  position  and  the  other  is  the  loss  of  a 
certain  sound  from  any  position.  The  first  type  includes  the  dropping  of 
at  least  one  consonant  from  a  consonant  cluster  and  the  dropping  of  cer¬ 
tain  sounds  in  final  position.  The  second  type  includes  such  cases  as 
the  loss  of  Proto-Indo-European  p  in  almost  all  positions  in  the  Celtic 
ianguages. 

The  problem  of  how  to  treat  these  drop-outs  is  a  complex  one. 

At  present  we  arc  attempting  to  limit  th.j  numbei  of  positions  in  the 
charts  from  which  a  sound  that  is  not  part  of  a  cluster  can  drop  com¬ 
pletely.  Thus,  at  present  we  assume  that  ("it]  does  not  drop  out  directly 
but  becomes  .  which  becomes  w.  which  drops  out.  In  Appendix  H.  IV 
we  give  some  examples  of  sound  drop-outs. 

E,  DURATION 

In  English,  differences  in  consonant  durations  are  not  phonemic 
(word-differenlLating)  and  the  speakers  of  the  language  have  some 
freedom  to  vary  these  durations.  The  most  common  variation  is  to 
shorten  a  long  consonant  (or  a  double  consonant;  wc  use  thc;s(’  two  terms 
as  synonyms).  Thus  red  dress  may  be  pronounced  with  a  long  or  a 
short  ^J.  Long  consonants  which  are  the  result  of  assimilation  or  the 
dropping  of  an  intervening  consonant  are  also  subject  to  shortening: 
outdoors  can  be  spoken  without  the  03  and  with  a  long  or  a  short  O']; 
lasts  can  be  spoken  without  the  03  with  a  long  or  a  short  £s3  • 

Conversely,  some  short  consonants  may  be  lengthened  under 
certain  circumstances.  If  a  person  is  counting  slowly  and  rhythmically, 
he  may  pronounce  eighteen  with  a  long  bt;caiise  the  preceding  words 
fifte(3n  ,  sixteen,  and  seventeen  all  have  consonant  ciusLors  in  the  middle, 
and  if  the  speaker  says  eighteen  with  a  short  ^t]  ,  he  will  break  the  rhythm. 


The  problem  of  length  variations  due  to  rhythm  is  a  complex 
one,  Andre  Classe  has  advanced  the  hypothesis  that  if  there  are 
several  strongly  accented  syllables  in  an  utterancCj  the  speaker  tries 
to  vary  his  tempo  so  that  the  time  interval  from  one  accented  syllabic 
to  the  next  is  a  constant.  Classe  calls  this  equality  of  time  intervals 
isochronism  (Classe,  1939).  When  the  number  of  intervening  syl¬ 
lables  is  very  uneven,  the  speaker  tries  to  achieve  isochronism 
but  docs  not  succeed.  If  this  hypr^Lliesis  is  correct,  the  duration 
measurements  of  all  unaccented  syllables  depe'nd  on  the  positions  of 
the  accented  syllable.  This  is  a  subject  which  requires  extended 
research. 

P'.  INTENSITY 

Intensity  of  different  portions  of  continuous  speech  i.s  perceived 
to  be  different,  except  in  monotone  enunciations.  This  is  a  natural 
phonoinenon  of  control  that  a  speaker  exorcises  to  kei-p  his  speech 
from  becoming  boring. 

Such  variation  in  the  intensity  d(?pends  on  several  factors, 
some  of  which  are: 

1)  Emphasis  or  dectnphasis  of  an  utterance 

a)  modification  of  its  meaning 

b)  drawing  attention  to  a  spt'cifii.  pari,  of  it 

1)  Uelative  coinl)in<'ition  of  .speccli  soutuls  that  iic<'essilate 

the  einpliasis  or  d<*empJiasjs  bci'uusc  of  natural  iiniilations 
on  produi  lion  of  .specH  li. 

The  viiriations  of  the  firsi  type  are  ofl<'n  eonti’olletl  by  llie 
grammar  and  .syntax  of  a  langiuigc,  anti  they  do  not  merit  eonsideration 
in  tills  pari  of  the  study  ol  t ontrollcd  variaiion.s  eausetl  by  the  com¬ 
bination  of  sptrech  sounds.  The  variations  t)l  the  st'cond  type  merit 
discuoaion;  but  ncdllurr  is  orderly  tlelinitioji  of  these  available,  nor  c.tn 
infurmal.ion  about  these  be  .separated  fi'oin  that  for  the  first  type  of 
variations. 

Anaddiliunal  difficulty  in  this  lield  is  the  need  Lor  an  .u  eeplable 
di-finilion  of  intensity  and  the  interdependence  of  intensity  variation  aiul 
of  variation  in  fundamenliil  fin-tpu-nc  y.  An  orth'rly  slurly  of  the  efft-ds 
of  intensity  winild  rt'quire  extensive'  additional  study,  in  Section  '1  we 
pr<'.sent  example.^  of  acoustic  data  \'vlu<  h  sliow  llie  necc'ssily  of  intensity 
measurement  s. 

G.  P' U N  D  AM  K N 'I’ A  L  h' R  EQ  U  K  NC  Y 

Enndaineii'al  fr'eqin'ni  y ,  like  iulensil)  ,  varic.s  Iroin  oin-  jn^rlion 


of  speech  to  another  for  reasons  that  are  similar  to  those  mentioned 
in  the  previous  subsection. 

Unlike  the  measurement  of  intensity,  however,  one  can  find  a 
generally  acceptable  definition  of  fundamental  frequency'  that  can  be 
used  for  these  studies.  However,  the  information  on  this  subject  needs 
to  be  further  evaluated. 

H.  GLOTTAL,  ADJUSTMENTS 

The  two  glottal  positions  which  are  commonly  used  in  speech 
are  the  positions  for  voicing  and  for  voicelessness.  Besides  these, 
however,  there  are  other  glottal  positions  which  are  sometimes  used 
and  which  result  in  a  different  spectral  pattern.  Two  of  these  adjust¬ 
ments  are  visarga  and  laryngealir.ation. 

During  a  visarga  vowel  the  vocal  flaps  vibrate,  but  they  do  not 
touch  each  other  as  they  do  for  normal  voicing.  The  resulting  vowel 
has  a  somewhat  breathy  quality.  In  Appendix  1  we  include  a  detailed 
description  of  the  visarga  vowels. 

One  theory  of  the  production  of  a  laryngealiaed  sound  is  that 
the  vocal  flaps  take  on  an  hourglass  shape,  and  both  the  front  and 
back  halves  vibrate  while  the  middle  and  the  ends  arc  relatively  still. 
The  resultant  sound  has  a  slightly  grating  quality.  Laryngcalaation 
is  often  used  in  American  speech  as  a  substitute  for  a  drop  in  funda¬ 
mental  frequency  at  the  end  of  an  utterance. 

IV.  SOUND  CHANGES  INVOLVING  "PROBLEM"  PHONKS 

A.  CHANGES  INVOLVING  PHONES  WITH  A  SINGLE  PLACE  OF 
ARTICULATION 

The  phones  which  commonly  occur  in  English  but  which  present 
curtain  probleins  include[’y'J  (you),  f  r]  (right),  ["hj  (how),  and  the  glottal 
stop  cn  (tlic  paubv  between  vowels  f»r  uh-ob ! ).  The  glottal  slop  alhu 
commonly  replaces  t  in  e(?rtain  words  and  phrases,  sucli  as  wliat  was; 
lor  this  reason  it  must  be  included  in  the  moch  l, 

'I'l^cre  are  several  separali?  reasons  why  thl^^^•  phones  are 
problematic,  Y  is  a  senii-vowol  corresponding  lo[^iJ{lhu  vowel  of  eat) 
in  manner  of  articulation  save  that  y  functions  as  a  consonanl,  AUhoiij»h 
we  have  classed  y  with  tin?  consonants',  its  biaiilarity  to  a  voweJ  creates 
unresolved  problems  in  describing  manner  of  articulation.  In  tlie  case 
oil  r]  it  is  possible  Inal  the  p.'ace  of  articulation  of  this  usually  retroflex 
consonant  can  vary  in  Ameri-  an  larglish.  The  soundfhjis  traditionally 
di.'seribcd  as  a  voietdess  vowel,  since  they  receive  a  separate  dimension 
it  seems  necossary  to  devote  a  separate  set  of  Reiniann  leaves  to  f  hi 


54 


plus  vowel  combinations.  The  problem  with  the  glottal  stop  re  sluts 
from  its  place  of  articulation;  our  model  provides  for  no  place  of 
articulation  further  back  than  the  guttural,  but  the  glottal  stop  is 
articulated  further  back. 

In  the  Appendix  H.lV  to  this  section  we  include  a  list  of  rules  affec  t¬ 
ing  these  phones.  The  rules  fori^y]  ,  '  h.l  ,  and  r.  arc  likely  to  be  much 
more  complete  than  the  rules  forf  jl,  both  because  the  foreign  languages 
from  which  we  have  derived  our  rules  have  v],  ','r],  ancl[h],  but  no 
glottal  stops;  and  because  the  occurences  of  the  glottal  stop  requires 
further  investigation  into  the  phonetics  of  American  English.  Although 
we  doubt  that  the  glottal  stop  replaces  an  initial  t,  for  example,  this 
has  not  been  proven.  There  is  also  the  additional  possibility  that  tiie 
glottal  stop  may  replace  final  t  before  gutteral  (that  gi rl)  as  well  as 
before  a  bilabial.  In  certain  New  York  accents,  moreover,  llie  glottal 
slop  also  replaces  before  1  as  in  little. 

B,  CHANGES  INVOLVING  PHONES  WITH  TWO  PLACES  OF  ARTICULATION 

There  arc  two  commonly -occuring  consonants  in  American  Elnglisii 
which  have  two  8  imultaneous  places  of  articulation.  These  are 'wins 
k  _/  as  in  quit  .  Both  of  these  consonants  involve  siunil- 
laneoiis  giiitcjral  and  labial  articulation.  This  combinalioii  represents 
the  most  common  ty-pe  of  simultaneously  articulated  (co-arliculatvcl) 
consonant.,  known  as  labiovelors.  Since  labiovelars  have!  two  different 
places  of  articulation,  they  can  also  have  two  manners  of  articulation. 

For  Lk  J  l.ho  lips  are  rounded  and  open,  while  tj\e  loiigiu)  makes  a  e<jin- 
plcte  closure  at  the  back  of  the  mouth.  Forf^wJ,  tiieru  is  conslricticii 
both  at  the  lips  and  at  Uie  back  of  Ihe  nioutli,  but  there  is  no  eomplele 
closure.  '’J'hc  rules  in  Appendix  IlJV  di.'r.jl  wdth  .sound  chanj^e.s  llial  lins'e 
affected  labiovelars  in  oilier  languages. 

V,  THF  RELEVANCK  OF  SANDHI  RULFS  OF  SANSKRIT  'J’O  OUH 
MODFl. 

In  flevelopiiig  a  <»i’neral  purpose  5  ran.sc  r i bur  the  rule.s  of  s.'Liidhi 
ar<'  particularly  valuable  because  they  seem  to  bi-  u-u  indie. ilion  of  how 
sounds  are  produced  under  condilLuns  of  normal  conlinixnis  speech.  A 
partial  reason  for  this  is  tlial  Sanskrit  ^ramlnarians  wlio  bji*  nuilalcTl 
sandlii  rule.s  wished  to  produce  a  ot  precepts  lo  describr  a  language 
a.s  it  was  currently  being  spoken.  In  dt'sc  ribing  lliis  language,  mori-over, 
the  grammarians  wished  lo  produce  phonelic  clarity  as  well  as  graiiimatic 
precision.  As  a  result  sandhi  ruh'S  glvir  particular  .'itlentiun  U>  problems 
which  English  griiminar  (as  opposed  lo  the  science'  ot  plionelics)  tends 
to  ignore.  Thus,  sandhi  incbidfs  not  only  lales  of  euplmnic  and  grammatic 


combination  within  words,  but  also  the  phonetic  combinations  likely  to 
occur  when  two  words  come  together;  such  combinations,  moreover, 
are  usually  expressed  in  the  Sanskrit  spelling  as  well  as  the  rules  of 
grammar. 

To  the  problem  of  changes  caused  by  coalescence  of  sounds 
between  words,  English  phonetics  has  given  relatively  little  attention, 
but  the  importance  of  being  able  to  identify  such  changes  with  our  model 
is  apparent.  Speakers  tend  to  pronounce  their  sentences  in  rhythmic 
phonetic  phrases  whose  boundaries  need  not  coincide  with  those  of 
the  words  involved,  a  situation  that  provides  the  basis  for  familiar 
jokes  about  children  who  return  from  church  singing  songs  they  learned 
orally  about  the  three  kings  of  "ory  and  Tar"  (three  kings  of  Orient 
are.  .  ,  .  Such  verbal  configurations,  as  previously  indicated,  are 
the  result  of  the  natural  transfer  of  a  sound  from  the  end  of  one  word  to 
the  start  of  a  following  word  whenever  the  conditions  of  physical  articula¬ 
tion  make  this  easier  for  the^-speakrT  than  pi’onrnmcing  (he  two  words 
distinctly  and  the  transfer  vnay  not  interfere  witli  clarity  of  moaning. 

According  to  our  present.  ('vi<lonce  English  phonetics  does  not 
seem  to  analysie  fully  the  problem  of  transcribing  such  transfers  in  an 
orderly  fashion.  Sandhi  rules,  however,  distinctly  recognize  the 
possible  shift  caused  by  the  coalescence  of  a  final  sound  with  an  initial 
one,  and  the  sound  change  so  procluci^d  is  often  formally  defined.  Recog¬ 
nition  of  sucli  combinations  between  words  is  an  important  aspect  of  a 
general  purpose  transcriber,  since  the  acoustit  characteristics  of 
a  particular  letter  may  vary  (•on.sid<?rably  depending  on  preceding  and 
following  sounds.  £  and  1,  for  instance,  ar<>  exlrm-ne  (sxamph's  of  this. 
The  following  discussion  considers  particula »*iy  tliosc  aspects  of  sandhi 
that  are  rtdcivani  in  expanding  our  comprehension  of  sound  clianges 
liki'ly  to  take  plact;  between  the  sounds  of  sc^paratij  words,  while  il 
provides  a  rationah?  for  further  rules  of  euphonic  combination  prirseatcil 
in  Appcuidix  11.  V.  In  using  Sanskril  rules  to  lu  lp  \ih  predict  English 
sound  changes,  however,  we  must  consider  tlie  nalure  of  fuich  ruU?  before 
deciding  licnv  to  utilize  it. 

'J’lu'i-e  are  two  lypc.s  of  sandhi  rules,  and  wlah;  both  are  helpful 
in  our  moclel,  Kn-ir  applicability  is  dlffeiuoil.  'I'he  fir.st  type  of  rule  is 
the  result  of  lundilions  which  prevail  in  all  hutnan  language,  and  therefore 
this  type  is  directly  applicable.  A  lypit  al  rule  of  this  kind  L.y  that  when 
a  voiceless  consfinant  and  a  voiced  consonant  come  together,  one  is 
chnnged  so  that  either  both  are  voiced  or  both  are  voiceless.  This 
rule  stems  from  ll»c  fact  tlial  human  beings  find  il  easier  to  pronounce 
two  v^oic(»d  or  two  voi<'el«*,s,K  I'onsonaiils  together  tliun  to  change  the 
vocal  fla]}  adjustment  in  tin-  n^iddle  of  a  consonant  cluster. 


The  second  type  of  sandhi  rule  is  the  product  of  historic  survival 
or  analogic  change.  Historic  survival  is  the  retention  of  a  sound  in  a 
particular  phonetic  environment  after  it  has  been  changed  or  dropped 
everywhere  else.  Analogy  is  the  extension  of  a  rule  from  one  group 
of  forms  (in  this  case  sounds)  to  a  group  of  similar  forms.  There  is  a 
sandhi  rule  to  the  effect  that  before  certain  voiceless  stops  n  is  changed 
to  anusvara  (nasalized  vowel)  and  a  sibilant  is  inserted  between  the 
anusvara  and  the  following  stop,  This  rule  represents  both  an  his¬ 
toric  survival  and  an  analogic  change.  At  an  earlier  stage  of  the 
language,  there  was  a  very  large  group  of  words  ending  with  ns  and  a 
smaller  group  ending  with  n.  As  the  result  of  a  sound  change  the  s  was 
lost  from  the  ns  words  everywhere  except  before  certain  slop  consonants. 
This  meant  that  in  most  environments  the  ns  words  were  Identical 
with  the  ri  words,  but  in  certain  cases  tlie  ns  words  had  an  s,  while  the 
n  words  did  not.  Gradually  people  forgot  the  difference,  and  the 

ns  group  was  larger,  the  rule  which  applied  to  that  group  was  extended 
by  analogy  to  the  group. 

We  would  not  expect  this  rule  to  apply  to  English  clirectiy  because 
English  has  not  undergone  the  sound  change  necessary  for  the  liistoric 
survival.  It  does  suggest,  however,  lliat  it  is  worthwhile  to  sec  wlielluT 
the  same  processes  might  have  aded  on  soinc  other  English  sounds,  and 
we  do  liavc  one  example  of  that:  tlic  so-called  "intrusive  r"  of  N(.‘w  England 
speech. 

In  some  New  England  dialects  final  r  was  lost  after  vowels  ev  ery- 
wlxri’e  except  when  the  next  word  startiul  with  a  vowel.  The  result  was 
that  llie  pronuiiitiatioii  of  the  word  clec'r  was  different  in  the  phrases  dee r 
walks  and  deer  is.  The  pronunciation  of  deer  before  a  eonsonant  was 
idtmlical  with  the  pronunciation  of  Ihe  last  part  of  id^a  ii  all  phonelii- 
environments.  Gradually  the  rule  about  r  spread  from  d e < '  v  is  to 
id<»ar  is. 

A  careful  examination  of  all  liistorie  and  analogic-  sandhi  rules 
logelhc:r  with  a  study  of  the  hi.stury  of  English  sounds  should  probably 
serve  lo  point  oul  more  su<  h  rules  wliich  apply  to  E'ligUslo 

ill  Appendix  H.  V  we  have  separaU’d  tlu*  rules  svhieh  we  believe 
to  be  liistoric  and  analogic  from  those  which  we  beliiwt?  to  be  phoie.dic. 

'I'he  other  rules  of  sound  change  in  Aj^pendix  M  are  all  phonetic. 

TJie  first  particular  contribution  sandhi  nuiy  make  for  our  model 
.s  lo  furnish  an  orderly  guide  for  the  coaiescenci;  of  the  final  eunsoJianl. 
of  out.  word  with  an  initial  vowel  of  the  following  \\;ortl.  in  Sanskrit  tlu- 
words  ah.nn  adilyair  are  written  according  to  syllables,  as  follows: 


a  ha  ma  di  tyair.  The  final  consonant  of  gham  goes  with  the  initial  vowel 
of  adityair.  The  combining  of  a  final  consonant  with  an  initial  vowel  is 
quite  common  in  English  speech.  Most  people  distinguish  an  aim  froi-n 
a  name  only  when  they  are  being  extra  careful.  This  sound  change  must 
be  recognized  in  our  model,  since  we  treat  consonant -vowel  combinations 
as  a  unit. 

A  second  potenlial  contribution  of  sandhi  rules  is  to  delineate 
possible  conflicts  and  sound  changes  that  can  occur  in  the  articulation 
of  one  or  more  consonants  whicli  arc  part  of  a  consonant  cluster.  The 
phrase  sit  down  is  likely  to  boconic  sidown  in  normal  speech,  as  has 
already  been  pointed  oi!t.  On  (he  other  hand,  tlic  same  drop-out  of  a 
voiceless  stop  next  to  a  voiced  slop  with  similar  place  of  articulation  is 
not  likely  Lt>  occur  in  the  plirase  look  good,  because  a  drop-out  of  the 
k  might  destrtjy  clarity  of  articulation  atKl  cvuisequcntiy  clarity  of  meaning. 

With  further  research  into  the  effects  of  combining  different 
physical  articulations,  we  may  find  it  possible  t(i  utilize  sandhi  rules 
in  setting  up  a  ludaLivtdy  complete  set  ul  inslrucUons  to  iiUorin  our  model 
exactly  when  and  how  t;hanges  will  lake  place. 

A  furtlier  advantage  in  aptilying  san<llu  rules  to  our  model  is  tltat 
they  point  out  the  existence  in  English  of  certain  .sounds  whicli  had  hitlier- 
to  been  considered  peculiar  to  Sanskrit,  One  of  these  sounds  is  the 
visarga,  which  is  a  vowel-like  stnuid  similar  to  h.  In  Sanskrit  visarga 
occurs  under  caTlain  ci rcuiiistiinccs  as  a  subslilule  for  final  s  or  r.  In 
Ell  glisli  we  have  observed  it  in  New  England  speecli  at  the  etui  of  Iho  words 
car,  law  ,  an<l  Decendx  r  (s«h'  1^.  Dcik's,  Computer  i^rocessing  of 
Ac(jusli<  and  Linguistic.-  Inloruialion  in  Automatic  Speech  ftecognition  , 
Contract  No.  At*'  hi  17f*,  March,  tiniverhily  College  London, 

I'-.ngland,  Fig.  7{a),  jiagc  «i7),  and  licfore  the  final  k  of  park.  If  tlie 
Sanskrit  rule  applltcs  to  Engtisli,  il  coViTs  only  c'ar  and  December,  since 
tlic  r  td  park  is  not  liiial  and  law  lui.s  no  j-.  il  is  jiossihle  lliat  in  English 
visarga  can  o<  cur  with  any  final  vowel.  (In  the  New  J'lngland  dialeiTs 
car  ends  with  a  vowel,  since  thcri*  Is  m<)  filial  r.  )  This  rule  would  cover 
<'ar,  December,  and  law.  hut  not  park,  ft  may  be  that  visarga  oticurs  in 
]jark  only  wluui  llu*  comph-le  closun-  of  llir  stop  is  held  si.)  long  that  lln’ 
vovvid  is  like  .i  final  vtjwel. 

As  outlimnl  aliovi-  and  more  spi’eifu  i.illy  tabu Ui t lal  in  Appendix  II,  V, 
sandlii  rules  present  a  higlily  useful  guide  in  (he  deviTopinent  of  an 
organized  sid  of  data  on  v/hich  to  ba.se  iuslrm-lions  about  sound  eliarige 
for  our  model.  Willi  an  ordrriy  rep  re.smital ion  of  phonemes  and  an 
ordv^.'iTy  re])ri'Si'nlalimi  of  tin  rulo.s  of  .sound  *  liauge,  il  i:an  bt'cume 
possible  to  present  rules  lor  the  speeeii  ol  seleeled  I Liiiguagt' .s  organized 


58 


BO  that  the  rules  may  be  suitable  for  mathematical  treatment  or  for 
computer  analysis. 

VI.  representation  of  rules  for  euphonic  combination 

In  past  reports  we  have  indicated  that  in  the  course  of  normal 
speech  many  sound  changes  occur.  In  verbal  form  we  have  presented 
long  lists  of  such  changes  or  possible  changes.  If,  however,  such  data 
is  to  be  available  to  a  machine,  a  mathematical  (i.  e.  ,  symbolic)  repre¬ 
sentation  is  necessary.  The  rules  listed  in  Appendix  H.  VI  present  such 
a  digest  of  the  data  already  presented. 

The  form  these  rules  talce  is  that  of  "ordered,  "  speech-environ¬ 
mental,  "  rules  which  may  be  considered  from  a  "phonomenological" 
point  of  view.  These  terms  require  explanation.  Ordering  of  rules 
means  that  some  rules  have  priority  over  others.  If,  for  instance,  there 
were  a  rule  becomes  z  ordered  before  a  rule  that  £t  becomes  s,  the 
second  rule  would  never  take  effect  in  any  of  those  situations  where  the 
first  rule  applied. 

Speech-environmental  rules  arc  based  on  the  idea  that  the 
changes  which  occur  in  any  particular  sound  (phone)  arc  determined 
wholly  by  the  nature  of  the  few  surrounding  phones  plus  the  special 
characteristics  of  the  speaker.  Such  a  view  is  supported  b/  our 
contention  that  the  changes  which  occur  during  speech  are  determined 
by  the  mechanical  nature  of  the  speech-producing  machanisni.  Tlu' 
characTeristies  of  the  speaker  determine  where  he  lakes  ef  re  in  enun¬ 
ciation,  and  so  in  particular  take  iii?count  of  tliose  phonenile  tiistinelions 
wliieh  are  made  in  h's  language. 

Phenomenological  rules  are  tljose  winch  describe  an  outcome  for 
the  physical  articulation  of  speech.  They  ari^  to  be  contrasted  with 
probabilistic  and  randoiii  rules  wliicli  may  specify  for  any  particular 
situalLon  in  speech  tliat  any  of  several  outcomes  may  occur  depending 
eitlier  on  fixed  probabilities  or  on  unknown  outside  factors. 

Those  rulms  we  have  indicated  in  Appendix  H.  VI,  therefore,  are 
not  llio.se  for  any  particular  speaker,  but  rather  those  of  many  different 
speakers  with  many  different  c haraeterisUe s.  As  siu  h  lliey  represenl 
tli<}  first  recent  attempt  to  pres<tnt  the  eiiphonii  eombin.Ui'.ui  of  natural 
speech  in  an  orderly  fashion  that  might  be  understood  by  a  general 
speech  recognizer.  Such  a  presentation,  baset.1  only  on  available  data, 
cannot  be  considered  definitive  or  eoinplele.  In  iome  cases  there  is 


not  enough  information  about  rules  of  euphonic  combination  previously 
developed  to  represent  them  exactly,  and  certainly  a  considerable  amount 
of  additional  research  is  needed  in  this  area.  The  rules  compiled  in 
Appendix  H,  VI  are  thus  important  particularly  for  indicating  the  feasibility" 
of  such  ordering  of  acoustic  and  phonetic  data  for  speech  recognition  and 
for  outlining  potential  areas  of  further  exploration. 

With  the  tentative  development  of  such  rules,  and  although  we  now 
leave  the  linguistic  aspects  of  our  model  to  examine  the  acoustic  aspects 
of  analysing  speech  for  general  transcription. 


bO 


SECTION  4:  ACOUSTIC  CONSIDERATIONS  OF  SPEECH 


INTRODUCTION 

Our  object  in  this  section  is  to  relate  the  acoustic  information 
about  speech  to  physical  means  of  articulation  and  the  rules  of  euphonic 
combination.  Such  correlation,  while  involving  both  a  review  of  past 
research  and  utilization  of  concepts  developed  according  to  the  dimensions 
of  physical  articulation,  is  oriented  primarily  toward  the  future  develop¬ 
ment  of  comprehensive  theory  and  an  ordered  set  of  data  to  meet  the 
analytical  needs  of  a  general  parjiose  speech  recognizer.  Although  by 
no  means  covering  all  aspects  of  acoustic  research,  therefore,  we 
consider  those  lines  of  acoustic  investigation  that  may  be  most  relevant 
to  the  generation  of  further  data  for  trajisc ription  of  carefully  articu¬ 
lated  or  continuous  speech,  while  suggesting  potentially  useful  avenues 
of  further  exploration. 

Since  the  develupruent  of  vocoders  at  the  beginning  of  World  War  II. 
lliere  has  lu'cn  much  interest  in  idcMitifying  speech  by  its  characterisLic 
energy  patterns  as  recorded  on  .spectrograms;  inuustigalion  in  this 
field  hafj  led  to  a  considerable  amount  of  data  relevant  to  our  model  as 
well  as  the  developnna;nl  of  sev<*ral  liniiled  purpose  speech  recognizers 
disciiascjd  below.  In  many  aspects  of  specu  h  production  and  perception, 
liowevor,  present  euncopts  nt.’ed  furllier  clarification  to  lit  the  needs 
of  adequate  general  recognition  by  c  lei  Ironii  ecjuipnient. 

One  purticularly  j>rc‘ssing  questioji  is  llie  relation  of  past 
acoustic  analysis  to  tin?  ntreds  of  our  multi -diinensioncil  model.  Pub¬ 
lished  research  in  ai'ouslies  len<ls  to  cotu  enlrate  on  speelrog  ra phi  e 
analysis  of  dist  rete  words,  sojne  nonsense  uyllabU^s,  and  only  a  f(jw 
seh'Cled  sentences.  Generation  of  comprehensive  information  on  .spetn  h 
production  and  perception  must  utilize  llu;  acoustic  data  oblaiiied  from 
words,  but  it  mud  also  depend  on  the  ii>iKepls  of  dimensions 
and  eontinu(‘Vts  physieal  interaction  of  sounds  whii  h  occur  in  continuous 
spee<-h.  It  should  be  noted  that  data  based  on  i  oncej)tH  of  continuous 
spci’ch  can  have  a  dir^-cl  relevam  <•  to  the  cicoustics  of  discrete  words 
v.'lurn  sucii  words  liavi*  more  than  one  syllable;  the  same  transition  of 
voietdess  slop  to  voiccti  stop  occurs  in  sit  down  and  cupboard. 

InU-gralion  of  pa.sl  sludii's  with  piusenl  acoustic  work  can  lluis 
define  more  clearly  the  inform, -itional  neerls  f.»f  speech  research  in 
several  related  fields,  notably  methods  of  ar  lit  u  latioii .  nu-lliody  of 
perception,  and  the  needs  of  a  speeeh  aclualed  maclune.  'J'he  results  of 
our  in  tegralioii  of  .siu:m  aspects  of  speech  suggest  three  gem  ral  neerls: 

(1)  KxU'iision  of  acoustic  studies  using  carefully  articulated  juinerl 
words  a.s  well  as  discrelr-  words:  (ri)  ModifLcalion  of  presmil  s<‘ts  ol 


61 


phone  classes  used  for  actuating  machinery  through  speech  and  for 
evaluating  speech  processing  equipment;  (3)  Continued  investigation 
into  physical  manner  of  articulation. 

While  the  first  need  cited  above  involves  primarily  an  orderly 
increa/se  in  the  available  data  by  recording  speech  on  spectrograms  and 
time -amplitude  plois»  the  second  and  third  call  for  a  considerable 
amount  of  complex  research  that  can  relate  aspects  of  present  acoustic 
research  to  the  actual  physical  production  of  sound.  Present  analysis 
suggests  the  probable  value  of  reducing  the  amount  of  data  necessary 
for  speech  recognition  by  organizing  acoustic  information  according  to 
concepts  designed  in  our  model.  A  further  efficiency  in  general  trans¬ 
cription  of  speech  is  made  possible  by  anticipating  the  effects  of  com¬ 
bination  of  womls  or  syllabo's  througl*  slored  iicoustic  daia. 

Our  r(‘sc‘arcli  has  ex.Luiiuecl  tlu*  propt* s  <.)i  plionc  classes  such 
as  denial  ii  wind)  have  um.le cgtuic  UltU*  ucouslii  analysis  in  English 
because  tlu-y  <l<)  not  di  ;;1  iugui  sh  between  tlie  nu'anings  of  woi'ds,  and 
Ijhoiu;  classes  .sm  h  <is  vi.sarga  vowels  whic'ti  art'  not  generally  recognized 
by  linguists,  ae ousticians .  and  phoneticians.  Althougli  non-di  slinclivc, 
s\icli  phone  c  lasses  are  identifiable  both  l>y  a  distiiKt  si.t  ol  criteria 
for  physical  ariicuhitiou  and  by  e.  recogni /.ulile  .s[)ircch  wave-form;  dental 
n  differs  from  alveolar  u  both  In  phu  <•  of  artii  ulalioji  and  in  llu’  second 
foruiani  iransition.  Sim c  a  gmuTal  nuabd  for  sptM’cli  rc.-cogjiition  must 
use  disliuclions  ai.o\isljcal ly  more  prcn  i.se  tlian  Lliose  cd  pliojunnic s ,  the 
ac  uuslic.  corrc'lales  of  .such  phone  i  lasses  are  inipoj’lanl  in  the  auto¬ 
matic  t  ranscripti"!!  of  .sound. 

While*  a<  ou.slic  jni'»>riuation  now  avail<iblc  cannot  give  a  com¬ 
prehensive  cles<  riptiou  of  English  speech  in  Ic  rins  as  complete  as 
tho.'U'  . ie.se  rlhc'd  above .  presc'iit  iniormalion  slilJ  tuu'd.s  tc*  he  orrlered 
in  Icriu.s  c  ap.iblc  of  ac  <  ouuling  lor  Ih*  detaiUd  ac  oustic  infurinalion 
menlicMied  in  the  abo\  e  paragraph.  WilhouL  sue  b  c  iileL'ori/.alion, 
invoK'iiig  a  reduction  n  ehoic  c's  nec  es.sary  to  idiMUil’y  a  sound,  llu?  almost 
ti.st  roiunnl c  al  varic'ties  ol.  plioiu-  c  lassc'S  whic  h  may  necul  to  be?  identified 
cciiilci  I  ranjui-iicl  the  capacities  ot  coinjuiUng  ecjui  pine  nl  lhat  now  exists 
or  that  can  !><•  projc'cted  in  (lie  •.  nc.sii'  fulnre.  To  accoulU  for  such 
j.u'ubleiiis  we  have*  <ih;cM(Jy  prc^jecLi'cl  .»n  or<h*rly  .i  rvangemenl  oi  phone 
c.  lasse.s  in  our  m(;del;  Ihi.s  seclicni  presenli;  a  porlio  >  uf  tin.’  available 
.icoionic'  d.ila  to  juKlil'y  smdi  an  ordering. 

in  clariiyine.  our  nuahod  for  ^irranging  plcone  ckissc  .s  ac  cording 
lo  their  pliysic  al  im*an.s  of  prodiuTicjic ,  we  face  llu*  third  m-t  d  already 
c  ited  --  lurther  invesi.igalic»n  into  physic-al  manner  of  artic  ulalion.  The 
luiv  lion  ol  tlic'  tongue  in  lormiiig  mouth  cavilic-s  is  jai  rl  ic  u  lar  Ly  significant 


in  the  acoustic  production  of  distinctive  patterns  of  energy  concentra¬ 
tion  at  certain  levels  of  frequency  and  certain  instances  of  time.  The 
precise  role  of  this  function,  however,  is  not  yet  fully  defined  in 
relation  to  its  modification  of  the  speech  wave  patterns,  The  significant 
contribution  of  MIT  and  the  Royal  Institute  of  Technology  in  this  field 
are  not  to  be  discounted:  we  suggest  merely  an  extension  of  their 
methods  and  a  modification  of  some  aspects  of  their  work  to  accomodate 
the  new  concepts  which  have  been  introduced  through  our  model. 

I.  BACKGROUND  OF  ACOUSTIC  WORK 


Although  many  of  the  concepts  in  this  field  have  been  under 
investigation  since  the  start  of  studies  of  generation,  of  propagation, 
and  of  perception  of  sound,  recent  investigations  in  acoustic  phonetics 
seem  to  be  closely  connected  with  instruments  that  are  similar  to 
Dudley's  vocoder. 

The  vocoder  is  essentially  a  bank  uf  band-pass  fillers  that  divide 
the  speech  spectrum  into  about  twenty  bands.  It  was  developed  primarily 
(or  secure  voice  communication  with  tlic  use  ol  bandwidtlis  which  are  a 
small  fraction  of  those  u.sed  for  tlie  conventiona!  telephone  system. 

(  rite  importance  of  vocoders  in  acoiistie  and  phonetic  research  is 
a  subject  of  Appendix  J).  Early  success  with  vocoders  increasiul  tlie 
interest  in  understanding  the  h.-isic  nature  of  speeeh  and  ol  its  recog¬ 
nition.  The  first  extensive  work  on  tiiis  subject,  reported  iit  tlie 
textbook  Vi  stole  Speech  liy  Poller,  Kopp  and  Green,  tusetl  a  moclifiea- 
lion  of  tile  vocoder  system  --  namely  a  spei.trogrum  --  for  presenting 
spectral  densities  of  speech  waves  as  functions  ol  lime.  The  speech 
waveforms  studied  IheiU’iii  wet'i'  idenlifiecl  by  tlie  words  or  (lie  sen- 
ti'iices  spoken  and  by  speaker  ulenlities. 

Human  observers  were  found  •■ai.'-ible  of  "reading’'  these  patterns 
iincl  also  of  relating  these  to  those  portions  of  the  spoken  sentences 
tliat  produced  their  waveforni.s.  This  was  foujul  to  be  true  even  I'lu'  speech 
waves  generaletl  by  a  miiubi'r  of  speaker,s  selected  (or  these  tc  st.s.  Sueh 
results  indicated  tlie  possibility  of  specifying  ebarae te ristic s  uf  these 
patterns  as  representative  of  ciTtain  articulatory  positions  of  tlie  longue, 
tile  lips,  the  mouth  cavity,  and  of  vocal  flap  vibrations  and  nasai  reson¬ 
ances,  Considering  tlie  expected  and  observed  varialiuiis  in  speech 
waveforms  of  diffi'i'enl  speakers  saying  the  same  sentence,  il  is  to  be 
expected  that  only  llte  gross  characteristics  of  these  pattern.s  were  used 
in  .stuflying  their  relationship  lo  ilie  above  aspects  of  speed)  production 
and  lo  speech  perception.  Information  obtained  froni  such  studies  lias 


61 


also  influenced  the  work  of  linguists  and  phoneticians  in  their  classifica¬ 
tion  of  sounds  of  different  languages.  One  such  work  is  that  on  the 
distinctive  features  of  "phonemes.  " 

Some  machines  that  could  be  actuated  by  speech  were  also 
designed  from  information  about  characteristic  patterns  of  speech  waves, 
A  brief  summary  of  such  activity  is  presented  in  Figure  18.  The 
possibility  of  designing  machines  that  could  be  actuated  by  speech 
opened  a  new  field,  {The  subject  is  discussed  further  in  Appendix  K). 

Speech  recognizers  have  been  built  which  assume  that  a  machine 
is  capable  ot  recognizing  words  or  phonemes.  Most  of  these  machines 
have  enjoyed  limited  success,  as  we  mention  in  Appendix  K.  That  is, 
most  of  these  machines  work  with  a  limited  vocabulary  and/or  with 
carefully  articulated,  Isolated  words  It  has  often  been  thought  that 
refining  the  existing  methods  would  increase  the  applicability  of  these 
machines.  However,  a  more  useful  approach  miglit  be  to  design  a 
method  of  computer  operation  to  account  for  anticipated  acoustic 
irnprocisions  of  speech. 

CQAUTICULATION  AND  DURATION 

'I'lie  present  extent  of  ai  oustic  data  on  s]>ei-ch  j>rodiiction  suggests 
tlie  %’alue  of  an  organi/a’d  analysis  to  correlate  what  is  available,  what 
is  important,  and  wiuil  is  still  need(?d  for  the  development  of  a  general 
purp(jse  recognizer.  'Ihe  following  discussioii  buggests  such  an  appruacli; 
^pp^!n<lix  T.  rela.ti!S  the  information  t)hlain('d  from  sueli  work  to  needs 
(MilUmal  in  Section  1.  At  |>resenl  it  may  be  notetl  our  sources  of  acoustic 
inCormetton  are  primarily  speetrograms.  'Die  |)articular  vjilue  of 
.spe(  l  rog  raphie  analysis  is  that  it  pres<-nts  wavi-forms  in  a  pietoriai 
repr<' sentalion  wliie.h  enhances  soiiu-  ol  llu;  more  significant  acouslie 
e ha  rac le rtsi u  s  of  sp<‘eeh.  M'lu'St.  ehajMcterislies  liave  he(‘n  related 
('>  .‘‘peech  {»er'  epllon  by  w‘*rk  at  T“h-pleue*  L''bf.r-:\h>ries  and  Haskins 
( yubo  L'aUn’j<* s  ,  and  related  to  speech  p  c«)(lu4- li on  by  work  at  13eli  'J’eleplione, 
Massacluiselts  Institute  of  'reehiiolugy ,  aiul  the  Jloyai  Institute  of 
Trciinology ,  Stoekholin. 

Potential  limitations  of  ilu*  spe<  lrogram.s.  Imwever,  are  suggested 
by  llie  tael  that  in  (lie  proi'e.ss  of  pr(.'S<’nUng  information  tiiat  is  most 
sLgiiifjeant  to  the  effieieiil  r<*<  epli<ui  of  speech  we  must  also  de«  ide  upon 
information  that  is  to  he  <  <>iisitl(‘red  redundant  Mortawer,  in  passing 
speeel.  throngli  a  hank  of  filte  rs  used  l.jy  Ihe  spectrograph,  speech 
chdraclurislicb  are  di.slurfrd. 

AUliough  siieli  decisions  an<l  distortions  may  <,'Ui  linatc  inforina  * 
lion  that  seems  not  to  be  essential  lor  human  perception  of  speech,  sucii 
as  intensity,  duration,  and  rate  of  articulation,  this  Lnlormation  may  be 
particularly  significanL  in  the  iiiterpretati*:)!!  anr’  identification  of  speech 
wave-forms  by  a  general  purpose  transcriber.  Thus  it  sliould  be  kt'pl 


GA 


gure  ISb  Recognizer  Development 


in  mind  that  the  data  presented  below  represents  only  one  type  of 
acoustic  measurement  of  speech.  (Another  potential  source  of 
information  is  provided  by  time -amplitude  plots,  which  represent 
graphically  the  frequency  of  the  whole  speech  wave  measured  against 
time.  ) 


On  the  basis  of  available  acoustic  data  with  the  limitations  noted 
above,  the  following  research  suggests  coarticulation  as  a  necessary 
description  of  real  speech  events  and  indicates  the  importance  of 
duration  as  a  dimension  of  our  model.  The  .substantiation  of  coartic¬ 
ulation  from  such  data  is  particularly  worth  noting,  since  almost  all 
the  speech  measured  was  in  the  form  of  short  words  or  carefully 
articulated  sentences  "manufactured"  to  illustrate  certain  phonetic 
sounds.  Natural  speech,  it  should  be  remembered,  tends  to  run 
adjacent  phones  together  considerably  more  than  the  data  discussed 
bclov^;  phonetic  transcription  of  such  speech  will  be  discussed  in 
tills  section. 

A.  STliniKS  IN  THE  IMPORTANCE  OF  COARTICULA  JTON 

Our  model  is  organized  on  the  assumption  that  vowels  and  con¬ 
sonants  may  be  artiiuilated  simultaneously  and  at  tiroes  from  the  same 
pliysieal  position  in  the  mouth,  as  illustrated  by  the  Multidimensional 
Model.  Initially  there  were  two  reasons  for  sucli  an  ordering  of  speech. 
The  first  is  common  observalion  that  many  consonants  --  parlieiilarly 
labials  such  as  p  -  allow  the  tongue  to  lake  the  position  of  a  following 
vowel  during  consonant  articulation;  a  subsidiary  support  for  coartii  - 
Illation  is  provided  by  related  expitrience  with  Sanskrit  plionelie  rules  of 
word  combination.  The  purpose  of  researcli  discussed  in  Ihis  subsection 
is  to  provide  further  evidence  for  such  cuarticulation. 

If  eonsonanls  and  vowels  can  be  coarlit  iilaled,  ll  siiould  follow 
that  mere  is  both  pliysieal  and  acotislic  evidence  for  siii  li  coarticiilation. 
Thai  is.  not  only  would  we  be  able  to  siiow  evidence  of  coartk  ulation 
through  palatograms  and  motion  pielure  X-rays  of  arUeiilatory  motions, 
Ititt  such  physical  coartii  ulati on  situuld  have  a  detinite  refalion  to  tlie 
acoustic  signal  of  tile  speecli-wave  as  measured  by  speclrogriiplis, 
formant  vocoders,  or  lime -amplitude  plots.  Formant  positions  of  i  on- 
sonants,  for  example,  might  be  expected  to  show  cerlai.i  variation  in 
their  transitions  depending  on  tlie  charat  te.t  islie  formant  positions  of 
following  vowels;  time-amplitude  plots  might  indicate  a  t  liaracteiistic 
sound  wave  for  eacli  vowel -consonant  combination.  Although  complete 
investigation  of  tiiese  plienomena  is  not  available,  a  survey  of  present 
data  does  suggest  tlie  presence  of  stieli  evidence  for  coa  rtic  illation, 


Present  physical  evidence  for  coarticuiation  is  provided  by  the 
work  of  H.  M,  Truby,  (Truby,  1959)  who  carried  out  acoustic  and 
phonetic  investigation  to  determine  whether  all  phones  are  influenced 
by  adjacent  pliones.  In  one  such  experiment  which  he  describes  in 
detail  Truby  took  motion  picture  X-rays  of  articulation  of  the  word 
plotch  ;  these  X-rays  revealed  that  at  the  time  the  Ups  burst  open  for 
p  the  tongue  is  already  in  position  for  the  following  1.  Similar  evidence  is 
available  for  other  consonant  clusters  in  v/hich  the  physical  articulation  of 
one  consonant  does  not  interfere  with  assuming  the  articulatory  position 
of  Uie  following  consonant. 

Truby' s  proof  of  coarticuiation  exists  only  for  consonants  and 
semi- vowels ,  but  it  has  long  been  assumed  by  phoneticians  that 
speakers  tend  to  reduce  the  lime  and  energy  expended  in  physical 
articulation.  Thus  in  circumstances  when  there  is  coarticuiation  of 
a  consonant  plus  a  following  consonant  or  semi-vowel,  there  is  also 
likely  to  be  coarticuiation  of  a  consonant  plus  following  vowel,  assuming 
this  is  feasible  under  the  physical  conditions  of  articulation.  Since  the 
position  of  the  longue  is  iitdependenl  of  the  articulation  of  labials,  for 
<!xainplc,  wc  may  expect  coarticuiation  in  tiie  word  patcti  as  well  as  in 
plotch  .  Suc^h  physical  coarticuiation  is  also  likely  to  occur  for  conson¬ 
ants  oilier  tiian  labiai.s,  though  probably  to  a  lesser  degree. 

iLvldcjK  e  from  many  sources,  moreover,  suggests  the  i)ossi- 
bilily  of  corielaliiig  jihysical  coarlii  ulalion  dismissed  above  with  specific 
features  of  the  acoustic  signal  as  nu.'a.snred  by  spectrographs,  other 
sptrclral  nieu.suring  processtrs,  and  time-ampliliule  plots.  Althougli 
no  exact  correlation  linking  I'lhysical  i  ocirlieulati on  and  differenctis  in 
(he  levels  oi  irai:  silioinii  y  formants  presiuilly  I’xists,  Ihei’e  is  partial 
-u  ouslic  jnsl ificatioii  lor  further  work  on  this  matter  in  tlie  researcli  of 
Use  ijehiste  and  Ciordoii  Piilerson  (Iji-hlsle  niul  PeteJ'son,  1900),  Bjorn 
Liadblcun  (landbloii),  196  5),  and  in  onr  own  cor iH-latiun  of  <laia  from 
I'luhy  and  Visible  Sp'ciri  h  . 

The  work  of  fise  J^ehi-slc  and  Ciordi'ii  PelcM'son,  spei' ificu  liy 
dis<  ussed  below,  was  uiid<-rtaken  to  investigal(!  wliellier  an  acoustic 
di.sliiKlion  belwct:n  forinaiil  iiioveiiu‘nl.s  existed  to  serve  us  cues  fur 
( onsoiuinl  identification  and  formant  movmiu'nts  which  signal  tlu- 
jMU'.seiK  e  of  a  complex  sylJal>»lc  irk  lens  sucli  as  a  glide  or  a  dipthong. 
Aelual  (.-xpif rimenl s  ilepeiided  primarily  on  the  arlicuhilion  of  oni’  sub¬ 
ject,  wIrj  pronounced  specified  words  within  the  "frame"  .sentence: 

".bay  llu-  word  again."  Speelrograms  of  Lhesi;  senlmices  were 

then  made  and  the  ri’stnun  lu*rs  measured  levels  of  tiu'  first  three 
lurmanls  at  the  lol  lowing  j.n.»ints:  .slarl  of  the  onglidcr  measured  at  the 


(d. 


0  5  10  15  20  25  30  35  40  45  50  55  60  65  70  75 

Time  in  Milliseconds 


Figure  20 


Application  of  the  Locus  Theory  to 
Three  Vowels  in  Natural  Speech 


consonant  release;  end  of  the  offglide;  duration  of  the  onglide  from 
consonant  release  to  steady-state  of  the  syllable  nucleus;  duration 
of  the  steady-state;  formant  positions  at  the  steady-state;  duration 
of  the  offglide. 

The  results  of  such  experiments,  shown  on  the  graphs  of  Figures 
19  and  20,  indicate  the  various  starting  points  of  the  onglide  and  the 
steady-state,  as  well  as  tlie  duration  for  different  vowels  after  the 
consonant  d.  Lehiste  and  Peterson  discovered  that  there  is  a  wider 
frequency  range  of  starting  positions  for  formants  of  vowels  following 
labials  than  for  vowels  following  other  consonants,  and  also  that  the 
average  duration  of  onglides  is  shorter.  This  is  most  conveniently 
explained  by  assuming  physical  coarticulation  of  vowels  and  conson¬ 
ants;,  since  the  tongue  is  not  used  in  articulating  labials,  it  is  freer 
to  assume  different  pliysical  positions  than  while  pronouncing  conson¬ 
ants  articulated  primarily  by  the  tongue;  such  opportunity  for  coarticu¬ 
lation  has  greater  effect  on  formant  transitions. 

Irfdiisle  and  Peterson's  work,  it  may  be  noted,  dealt  with  individ¬ 
ual  words  articulated,  if  not  discretely,  at  least  in  a  sentence  position 
set  off  by  pauses.  In  further  investigations  into  the  nature  of  formant 
variation  experiments  with  continuous  speech  are  needed.  Such  research 
(which  also  tends  to  suggest  the  coarticulatiou  of  vowels  and  consonants) 
was  performed  by  Bjorn  Lindblom  at  the  Royal  Institute  of  Technology, 
Stockholm. 

Lindblom's  work  was  carried  out  lo  Ic'st  the  hypothi'sis  tliat  the 
articulation  of  vowels  in  unstressed  syllables  is  cenlraliKed  —  that  is, 
occurs  at  the  mid-point  of  Ihe  traditional  phouetie  eharl  for  vowel  arti¬ 
culation.  He  made  sp.u  trographic  analyses  of  one  subject  pronouncing 
consonant-vowid-eonsonanl  cojiibinations  siicli  as  leak  under  varying 
timing  conditions  and  with  a  systematically  varying  context.  Having 
tabulated  his  spectrograms,  Bindblom  was  able  lo  compare  the  formant 
onglides  of  the  vowels  of  such  minimal  pairs  as  bob  and  gog;  liis  results 
indicated  considerable  frecjuency  varialiuii  in  the  steady-stale  of  icleiiti- 
cai  vowels  positioned  between  different  consonants. 

II  is  probable  that  the  formant  levels  ol  I.indblom's  vowels  were 
affec  ted  by  following  consonants,  but  liis  work  also  supports  our  belief 
that  the  articulation  ot  eousonants  and  following  vowel-s  is  interdependent. 
Representative  nieasuremc-nt s  taken  by  J..inclblum  (On  Vowel  Reduetion, 
p.  5b,  Figure  II)  for  the  vowel  a  bi  tween  b  and  the  £rec]uency  level 
for  the  vowel  sleady-slale  of  gag  is  approximately  550  cycles  higlier  than 
that  of  bab  whim  the  vowel  duration  is  very  short.  Sucli  measurements, 
duplicated  in  Lindblom's  work  with  other  vowels,  also  show  that  the 
influence  of  a  preceding  consonant  on  a  vowel  steady-slate  will  have 
an  inverse  relation  to  the  duration  of  the  vowel. 

In  order  lo  ascertain  whether  there  may  be  a  difference  in  the 
acoustics  ol  given  consonants  when  articulated  with  different  vowels, 


67 


we  made  measurements  of  various  formant  levels  of  consonant-vowei 
combinations  as  indicated  on  spectrograms  from  Truby,  Visible  Speech, 
and  Ijehiste  and  Peterson.  A  diacussioii  of  our  method  for  performing 
these  measurements  may  be  found  in  Appendix  M.  Although  a  certain 
factor  of  error  in  measurement  must  be  allowed,  and  although  the  data 
examined  is  by  no  means  exhaustive,  we  noted  a  considerable  variation 
in  the  frequency  level  at  the  beginning  of  the  onglido  ot  various  vowels 
following  the  same  initial  consonant.  For  example,  we  examined  the 
frequency  It'vcl  at  the  starting  poijit  of  the  voiced  second  formant 
onglides  following  the  consonant  b  :  for  these  starting  points  there  is 
a  variation  which  ranges  from  640  cycles  for  the  onglide  ol  O  to 
192.5  cycles  for  the  onglide  of  i  .  W<*  further  notefi  energy  ('fmc  ent  ra - 
lions  at  varying  frequencies  within  the  voiceless  portion  ol  s  when  it 
occurs  in  combination  with  various  vowels  and  semi -vowels.  Such 
i  oncentralions  occur  well  below  the  greater  energy  levels  of  the 
unvoiced  portion  of  s,  which  gather  at  regions  above  2500  cycles;  the 
treCjueiK  y  range  of  the  lower  energy  concentrations  can  vary  from 
between  230  cyiies  for  s  of  sweet  to  1750  I'ycU's  for  s  of  see.  Sucli 
variation  suggi'sts  the  influeiure  of  vowels  on  preceding  consonantH. 

It  also  suggests  the  existence  of  a  second  funnanl  onglitle  for  com¬ 
binations  of  voiceless  consonants  plus  vowids, 

J’abulat ii»n  of  llu*  data  presenK'd  above  tends  to  offer  a  body  ot 
(.•videiKe  in  conilicl  with  the  locus  llu'ory  (Dcdatiri',  Liberman  and 
Ciuoper,  1955)  wliicli  maintains  that  regardles.s  of  what  vowel  may 
follow,  all  hn'inant  Iransilion^  frf.Ho  any  .sj>ecifi<  ^'onsonanl.  b(*gin  at 
one  H|)(‘(  ific  fri  t|nency  level  f(»r  each  consonant,  ultliougli  the  initial 
pfirt  of  this  ongliih’  l  annul  be  inefisii re<l.  Aeeording  to  Ihi.s  tli(M>ry, 
foi*  example,  all  (be  (oriminl  transitions  following  the  eoiisonant  d 
have  H  cha ra<  te ri st i<-  slope  which,  if  exIeiKhui  l)a<’!<\vard,  would 
mecl  <(l  one  p<Uh(  lalled  llii’  hn  iij>,  (l)elaLtre.  v[.  «t.l.  ,  "-^c^mslie 
f.,oci  tind  'i'ran.Hitional  Cues  for  Con.sonaiils •  "  J.  Ai  i>usl.  Soe.  Ameiv  , 

P.  77  I  ,  Pigure  4.  ) 

.Such  a  (heory  if  valid  Wvnild  make  it  feasible  to  Irli-nlify  pj'i'- 
icding  w»iis*Hiants  by  the  slope  of  the  onglide  ol  llie  following  phone; 
it  vewt.'ls  and  v  loisonanls  an-  eoarljculaled,  howcvi-r,  Ih.e  fri’Cjm’ncy 
ies'el  til  onglide:,  \v«»uhl  seem  tu  <iepend  upon  llu*  pbysieal  arlimilalion 
ol  ifie  billowing  s*>w<l,  rather  Ibaii  cha  rai  te  ri  st  ic  onglide  slope  point¬ 
ing  b>  the  loeo.s.  Sini-e  the  re.soluU^ni  of  sui  h  a  contliel  is  important 
li‘  tlie  ide  nt  i  I  i  cal  i  c,n  ot  .i<<»uslie  asipetts  of  speeili,  s\  i*  disiuss  it  in 
In  rllu-  r  didai  I  l^eli>v 

'i'hc-  loeus  Iheory  vva.s  ilevj-loped  by  llaskiiis  LalK>raU}rieH  as  part 
ol  a  iiirlhod  for  .stylizing  formant  patlern.s  whii  li  ecjuld  be  pul  onto  llie 
lAitti'i'n  ]--iayba«  k  to  prodin  e  sound  wavi-.s  which  listeners  heard  a.s  speei 
lie  Hpto’i  li  sound.s.  A  primary  purpose  ul  Ihest'  exjierimenls  was  to  de- 
line  those  {  ha  rael  <•  ri  .slii.- s  of  tlu-  sp^'e^  li  wave  imptjrlanl  in  tlu-  pere(.‘ption 


of  phop<=‘  classes;  by  actuating  speech  through  synthesized  formant 
patterns  Haskins  was  able  for  the  first  time  to  conceptualize  and 
define  the  importance  of  significant  elements  of  the  speech  wave 
as  start  of  the  onglide ,  frequency  at  start  of  the  ongUde,  deviation 
from  start  of  the  onglide  to  steady-state,  and  location  of  the  center 
frequency  of  unvoiced  portions  of  the  noise  burst. 

Within  the  confines  of  this  important  research  the  locus  theory 
represents  a  system  for  ordering  data  aboul  the  acoustic  patterns  of 
the  speecli  wave  in  terms  suitable  for  experimentation.  At*  it  is  not 
feasible  to  construct  stylized  formant  patterns  without  imposing  some 
form  of  order  on  Ihoir  location,  the  locus  th;:ury  eventually  con¬ 

ceived  as  a  successful  effort  to  organize  acoustic  patterns  of  the 
sound  wave  from  the  Playback  to  be  identified  by  listeners.  Since 
the  experiments  of  Haskin.s  were  all  with  stylized  speei  h,  howi'vt'r, 
the  question  remains  whether  the  results  obtained  from  such  percept¬ 
ual  studies  can  directly  describe  Hu*  i: harac Le ri slic s  of  actual  speecii. 

In  applying  llu*  results  of  Haskins’  PtiUi'm  Playback  to  general 
speech  om*  would  need  to  rely  on  two  assumptions.  The  first  is  that 
tile  locus  lo  which  the  t  ransilionary  formants  lead  will  not  be  affected 
by  the  following  vowid.  Tlu*  st'cond  premise  <]eals  with  llie  nature*  of  the 
sLytlz<‘d  formant  patterns  which  ]>ro<!nced  sounds  identified  by  llsti^ncrs 
as  falling  within  given  plioin*  groups;  one  migh.l  lU'cd  to  assume  that 
such  stylixa'cl  patterns  pr<»duc  ed  Ihrougli  jnei'liauical  methods  caJi  bi‘ 
UK<’(I  to  giv<‘  information  aboul  tin*  acouslii-  c ha  racl e  r i  sti c  h  of  speech 
as  it  is  produc('d  by  human  beings. 

At  prc’senl  there  is  insullicrent  rlata  to  prove  or  disprove  tlie 
st'cond  premise.  'I'he  first  premise  is  called  into  cjuestion  by  tlie  datii 
on  pbysic-al  coa rliculalion  sliown  by  Truby  and  also  by  Die  variations 
in  formant  levels  which  were  fouiul  i>y  Lehislc  and  ollu  i*  resi’arclier.s 
in  combinal  Ions  of  one  initial  <on,s(.>nanl  willi  different  vowels.  'J'he 
e  xp<.' r  i  meal  s  of  J..«*histe  and  Peterson  also  suggest  lhat  in  all  cases  the 
|■(^»•m-.nt  I  r-in.sitions  id  given  consonants  will  iujI.  if  cNlendcd  Ijack,  meet 
at  on<-  locus,  in  l*'igure  )9  (wtiich  we  have  l  onslnu  led  aceorrling  to  tjie 
rmsults  published  by  Leliiste  and  P<  lerson)  sliows  tlie  diiH'rent  transi¬ 
tionary  onglitles  fr<un  flu*  <  <insi>nant  dilhe  onglidi  s  h.ive  been  exiended 
bul  they  do  not  nieef  at  oiu-  point.  Moreover,  in  plnliing  tlie  form. ml 
onglides  ol  the  vowels  on,  a.  and  u  aicau'ding  to  data  fr«>m  Lehisle  in 
I'^igiu'e  .^0.  it  is  apparent  that  tlu-  (iiiglides  cross  each  (*llu  r  in  that  por¬ 
tion  o[  the  speeeli  wave  winch  is  actually  n  :ea  su  l*  eab  U' . 

Kurlher  <lala  to  he  cunsidere<l  ,ire  the  <xperinients  at  Haskins 
il.sell  with  Palli-rn  playljuck  (J^elattrc,  J^iberman,  and  Caiop^'r,  I'^SS), 
'I'lie  experimenters  syiUhe.sizi-d  sp«a  lrograms  of  spei'c  h,  On  these 
slyliziul  spectrograms  the  first  form.iul  onglide  and  steady  -  st  .tie  N\as 
kept  cmistanl.  vsliile  the  si.-coMd  f«»rm.inl  wais  dravvii  at  various  levels. 


0^1 


of  frequency.  In  all  cases  transitions  to  various  vowels  were  drawn, 
originating  from  the  supposed  locus  of  d  rather  than  simply  pointing  to  it. 

When  these  patterns  (with  complete  formant  transitions)  were 
run  through  the  Playback,  they  produced  sounds  identified  by  listeners 
as  b,  d,  £,  anddd,  depending  on  the  relative  frequency  level  of  the  formant 
steady-slate  for  the  following  vowel.  When  the  initial  portion  of  the 
transition  was  erased,  howeTer,  all  sounds  were  heard  as  combinations 
of  d  and  following  vowel£». 

Haskins  explains  this  data  by  suggesting  tiiafpatt  of  the  change  iji 
the  position  of  the  articulators  to  go  from  a  consonant  to  a  vowel  takes 
place  during  a  silent  period  before  the  measureable  beginning  of  the  on“ 
glide.  Thus  transitionai'y  formants  only  point  bakk  to  a  locus  rather 
than  leading  there.  Such  evidence,  however,  might  also  indicate  that 
the  starting  point  of  n  formant  transition  after  consonants  depends  on 
the  articulation  of  following  vowels. 

Our  particular  criterion  for  using  coarticulation  as  an  acoustic 
division  of  our  model  was  availability  of  physical  and  acoustic  data 
tending  to  indicate  evidence  for  such  coarliculation.  In  the  review  pre¬ 
sented  above,  there  would  seem  to  be  present  sufficient  data  to  justify 
continued  use  of  coarliculation.  Having  thus  reviewed  the  acoustic  and 
phonetic  aspects  of  coarliculation  we  will  explore  in  the  following  section 
the  effects  of  a  phou<*'s  duration  on  the  acoustic  aliaracterislic s  of  its 
wave  form. 

0.  THE  ACOaSTIC  EFFECT  Ol--^  CHANGES  IN  DURATION 

Available  <;videiu-e  suggests  that  duration  mcasureinenls  may  be 
partieularl/  important  in  distinguishing  between  different  vowel-consonant 
combinations,  compensating  for  Individual  differem  es  in  slrcjss,  and 
identifying  words  uttered  by  speakers  using  different  dialects.  The  need 
for  making  such  important  distinctions  in  transcribing  speeds  supports 
our  iiu'lueion  of  duration  as  li  dimensitm  of  i>nr  n'orUd, 

‘J'lur  relation  of  duration  to  eonsonaut--vowel  combinations  has  been 
j^articularly  explored  by  HJdrn  I.indblom  who  rei'orded  consonant-vowel- 
eonsunaiil  syllables  und<!r  various  conditions  of  stress,  intensity,  and 
duration  ot  a  vowel  and  Du*  ievt‘1  of  the  final  formant  [)Obition  reached 
in  its  articulation  depend  on  the  prei  eding  and  following  consonants. 
According  to  Lindbloni,  moreover,  the  relation  betwiri'n  the  steady-state 
level  of  tlie  .second  lormanl  of  a  vowel  aiuj  its  duration  may  be  described 
by  a  matiimatical  cxpre.s.sion  derived  trom  i- u  rv  e  -  fitting  techniques. 

Such  relationsiiips  indicaU*  the  important  acoustic  nature  of  duration  in 
speei  h  production. 


7  0 


Lindblom's  work,  moreover,  indicates  that  the  duration  of  on- 
glides,  offgUdes,  and  steady-state  are  likewise  affected  by  preceding  and 
following  consonants,  but  the  effect  of  such  influence  is  also  related  to 
the  tempo,  mood,  and  emphasis  of  a  particular  speaker.  Such  data 
raises  complex  problems  in  constructing  a  general  transcriber,  since 
it  will  be  necessary  to  analyze  the  enunciations  ol  individuals  who  will 
have  different  rhythms  of  speech  and  different  dialects. 

Work  by  Peterson  and  Lehiste  (previously  mentioned  and  also 
discussed  in  Appendix  N)  has  provided  additional  information  on 
duration  based  on  sped rpgraphic  analysis  of  consonant- vowel-consonant 
combinations  articulated  by  one  speaker  in  a  pre -selected  environment. 
Such  research  provides  an  approach  tu  automatic  phone  classification. 
Lehiste  has  suggested  that  the  ratio  between  the  steady  -  state ,  and  oif- 
gUde  ol  any  particular  vowel  will  rcmaLn  constant.  The  Peterson  - 
Lehiste  data,  however,  is  limited  to  articulation  ot  a  .small  number 
ot  isolated  words  by  one  speaker  in  an  extremely  controlled  environment. 
Additional  work  is  needed  to  apply  sudi  data  to  a  general  purpose 
recognizer,  whose  problems  in  analyzing  various  dialects  and  spui;cti 
rhythms  have  been  noted  aoove. 

Spectrograms  of  a  standard  Brilisli,  General  American,  and 
Sotithern  pronunciation  oi  the  same  utterance,  shown  in  Section  2, 

Figure  9,  for  example,  intlit  ale  lliat  the  British  speaker  has  compara¬ 
tively  abrupt  formant  transitions  and  a  long  steady  -  stale,  while  llie 
Soutliorn  speaker  has  long  tvansitlonu  and  almost  no  sleatly-Htafc,  These 
data  seem  to  indicate  that  Leliisle  and  Peterson's  com  tusions  about 
General  American  are  not  necessarily  applicalile  to  oilier  dialects. 

Another  factor  whicli  makes  llie  measurement  ot  duration  impurlant 
is  that  in  a  consonant  elustor  a  c  omsoiiani  may  be  sliorleried.  We  have 
evidence,  which  is  presented  in  Figuiu^  21,  that  this  sliortening  affects 
1  ,  r  ,  and  w  after  voiced  stops,  anil  it  may  affect  oilier  eonsonanis 
also.  However,  the  duration  of  llie  vowel  following  tile  I'liisler  is 
apparently  not  affeeleil. 

One  additional  aspiu  l  of  duration  wliicli  again  seems  to  suggest 
its  importance  as  a  dimen.sioii  of  our  model  i.s  the  possibility  of  comparing 
the  duration  of  one  vowel  to  llie  duration  of  anotlier  as  one  criterion  for 
distinguishing  between  tluin.  A.<  onstic  in;'a.surenienl  conlained  in 
Appenrtix  N  indicate  lliat  the  duration  ot  i  is  always  longer  Uian  the 
duration  of  I  m  tlie  same  environment. 


7i 


III.  THE  APPLICATION  OF  AVAILABLE  ACOUSTIC  DATA  TO  THE 
NEEDS  OF  A  GENERAL  PURPOSE  RECOGNIZER 


Our  interent  in  studying  the  correlation  of  auditory  phonetics  and 
acoustics  is  to  extend  present  knowledge  about  the  characteristics  of 
sound  combination,  particularly  in  terms  of  data  useful  to  a  general 
model  for  speech  recognition.  Such  an  extension  involves  two  stepp  - 
the  orderly  classification  of  past  experimentation  as  it  relates  to  such 
problems  as  slurs,  segmentation,  sound  drop-outs,  or  sound  change,  and 
the  extension  or  moditication  of  such  experimentation  to  include  a  more 
precise  definition  of  the  var.'eties  of  phones  likely  to  appear  when  two 
or  more  sounds  occur  in  combination. 

Considerable  work  has  already  been  published  on  the  acoustic 
patterns  of  words  (generally  of  one  or  two  syllables)  uttered  "discretely"- 
ene  at  a  time.  Such  work  helps  to  define  the  initial  acoustic  chars.cter- 
istics  of  specified  words  and  in  some  cases  contributes  to  the  construction 
of  machines  for  limited  transcription  such  as  the  digit  rccognizer.s. 

As  has  been  suggested  above,  however,  the  articulation  of  a 
pa.r-ticular  phone  class  can  be  influenced  by  preceding  or  following 
phones;  the  words  sit  down  may  be  lieard  as  sidown  when  spoken  as  a 
phrase.  Such  modification  of  sound  may  also  cause  the  acoustic  pattern 
of  discrete  words  to  differ  from  those  of  words  or  plirase-s  in  which 
two  or  more  syllables  occur  together,  as  in  Die  case  of  cupboard,  which 
exhibits  the  same  transition  of  voiceless  slop  to  uoiced  slop  that  has 
been  observed  in  sit  down. 

Since  the  cons  .-uction  ol  a  goiU!ral  model  tor  speech  recognition 
re<niiTes  information  about  the  wave-torms  of  phone  claB.sos  as  they  occur 
in  combination,  there  is  an  apparent  value  in  dala  relaling  to  the  acoustics 
of  carefully  arliculaled  speech  and  speech  as  il  occurs  in  nornial  con¬ 
versation.  We  have  used  llie  data  of  the  work  of  Potter,  Kopp,  and  Green 
m  Visible  Speech  for  our  infornialion  abuiU  Die  aiousUc  i.orrelates  of 
euphonic,  combination.  (Published  results  ot  Ibi’  investigation  in 
Visible  Speech  lakes  the  form  of  wide-hand  spectrograms  representing 
careful  cnuneialion  ot  sentences;  in  Iheir  work  il  was  actually  possible 
to  1  rain  obse rvers  to  "read"  sounds  represented  by  these  speelrograpliic 
patterns  whellier  as  a  wliole  sentenee  or  as  fragments  of  sentences). 

before  discussing  the  relevance  ot  Visible  Speech  reaearch  to 
our  model  certain  aspects  of  its  basic  dala  should  be  noted.  Speech 
which  Potter,  Kopp,  and  Green  transcribed  consists  ol  phrases  eillier 
carefully  selected  or  specially  prepared  to  illustrate  certain  phonelie 
combinations.  The  sentenee  "Have  Imlf  aliove  five;,"  for  example,  was 


72 


composed  simply  to  illustrate  acoustic  differences  between  h,  £,  and  v. 

The  words  in  Visible  Speech,  moreover,  are  articulated  with  considerably 
more  presision  than  in  normal  speech;  in  the  sentence  When  did  you 
cut  the  wheat  ?  there  is  a  measurable  pause  of  at  least  85  milliseconds 
between  complete  decay  of  the  final  t  in  cut  and  the  onset  of  the  , 
although  most  speakers  would  combine  the  sounds  either  in  careful 
raading  or  normal  conversation.  Speech  articulated  with  such  extreme 
precision  may  not  entirely  reflect  the  acoustics  of  general  speech. 

Such  speech,  nevertheless,  dues  provide  a  valuable  basis  for  the 
formulation  of  data  about  sound  waves  and  sound  substitution.  Even 
with  extremely  precise  articulation  as  in  Visible  Speech  it  is  possible 
to  secure  tentative  acoustic  confirmation  for  a  number  of  thu  phonetic 
substitutions  which  our  rules  of  euphonic  combination  suggested  as  likely 
to  occur  in  continuous  speech. 

Close  examination  ol  the  spectrograms  from  Visible  Speech 
provides  evidence  for  such  changes  as  the  voicing  of  a  normally  voice- 
ioss  consonant '(indicated  by  the  presence  of  harmonic  energy  in  the  very 
low  frequency  ranges)  or  the  substitution  of  a  stop  for  a  spirant  (indicatc^cl 
by  the  absence  of  a  stop  gap).  In  several  instances  such  actual  changes 
in  earefully  articulated  continuous  speech  reflect  phonemic  cltanges, 
tiased  on  physical  means  of  articulation,  that  we  have  ulreacly  suggested 
as  probable.  For  example,  we  suggest  the  rule  that  "a  voiced  consonant 
can  become  voiceless  if  it  occurs  before  a  voiceless  stop,  spirant,  or 
sibilant;**  our  example  of  a  situation  in  vvliich  this  miglit  ha]>])eM  during 
continuous  sj)e<;rh  is  big  town  .  An  (examination  of  sjx-ct  rograms  from 
Vislbhe  Sp<?LH)i  gives  the  ^couslie  indication  of  sncli  a  substitution  in  the 
scnt(Mic(e,  This  is  such  a  big  church  (p.  156);  in  lliLs  actual  example 
lliciu*  is  no  voice  bar  for  lhe"s"  of  is  ,  so  that  llu^  voic  ed  consonant  has 
become  voiceless  and  the  word  is,  normally  pronounced  "i/.",  may  bi^ 
peiu'eived  us  "iss."  Qtlu:r  instances  wliere  availal)le  acoustics  infinuoation 
supports  our  suggested  rules  of  sound  (  hangt*  ,irc  cUseussecI  in  Appendix  O 
of  this  report. 

At  present,  it  should  ije  noted,  most  of  our  data  is  limited  to 
reduced  illustrations  of  spectrograms  from  Visible  Speech;  sucli 
illustrations  giver  more?  intorniatiem  about  aspects  of  spem  h  production 
indicated  by  random  energy  ijv  discontinuities  tluin  Fib(‘)iil  those  aspects 
dependent  on  formant  transitions,  Tluis  the  piu.-snnl  confirmation  of 
rules  ut  euphonic  coinbiiiation  is  confined  mainly  to  manner  of  articulation 
and  ’'esunanccec,  which  produce  suct^  readily  identifiable*  acouslit 
characteristics  as  full  closure  of  sound,  energy  bursts,  friclioiuil 
iuiergy,  or  characteristic  placement  of  formants.  Iniormation  about 


place  of  articulation  requires  the  generation  of  additional  acoustic  data, 
particularly  in  relation  to  spcctrographic  analysis.  Time -amplitude 
plots,  seldom  analyzed  in  relation  to  continuous  speech,  may  also  yield 
helpful  information,  possibly  about  place  of  articulation  and  almost 
certainly  about  manner  of  articulation  and  resonance. 

The  limitations  of  data  on  the  acoustics  of  carefully  articulated 
continuous  speech  and  the  relative  success  in  confirming  rules  of 
euphonic  combination  with  such  limited  data  suggest  the  need  for  further 
research  in  this  area.  One  additional  impetus  to  such  research  is  the 
possibility  that  analysis  of  continuous  speech  may  reveal  or  emphasize 
the  importance  of  acoustically  distinct  phone  chasses  not  generally 
recognized  by  phoneticians.  Research  discussed  in  Appendix  O  has 
already  supported  the  existence  of  sisarga  and  stressed  the  distinct 
acoustic  characteristics  of  dental  n.  Information  about  such  phone 
classes  may  be  helpful  both  in  the  accurate  transcription  of  continuous 
speech  and  in  the  identification  ot  regional  variations  likely  to  occur 
in  discrete  words. 

Although  such  a  discussion  can  only  summarize  the  most  important 
aspects  of  the  relation  between  acoustics  and  phonetics,  the  value  of 
furUier  experiments  to  extend  data  from  discrete  words  is  apparent. 

Such  research  might  involve  investigation  into  tlie  acoustic  patterns  ot 
an  orderly  set  of  coarticulated  phone  classes  in  poly. syllabic  words. 

With  sufficient  information  of  this  type  at  our  disposal  it  could  then 
become  fea.sible  to  evaluate  such  data  as  eharacterislie  acoustic  patterns 
as  tliey  occur  in  continuous  speech.  Such  research  can  enable  us  both 
to  derive  further  examples  of  euphunie  combination  as  it  occurs  in 
natural  speech  and  to  «ipply  this  data  to  tlio  acoustic  equipment  ol  a  general 
purpose  transcriber.  Sonic  aspects  of  research  recommended  for 
immediate  work  in  lliis  area  form  ihe  basis  of  the  following  discussion. 

I V .  APPLICATION  OP  THE  RULRS  OP  EUPHONIC  COM BINATION  TO 
CONTINUOUS  SPEKCn 

The  final  tests  of  rnlc.s  of  enpliouie  combination  are  wbelher  these 
rules  describe  situations  lli.it  coininonly  occur  in  general  S]H;i  eh,  and 
wlietlier  such  rules  may  be  etteiTively  utilized  lor  llie  analysis  of  .speech 
by  a  general  purpose  speech  recugnizi' r.  Research  discussed  in 
A|>pundix  P  is  accordingly  designed  to  lest  eiiplionic  lules  previously 
reported  by  applying  them  to  boUi  cariTully  articulated  and  conversational 
s|)eeeli.  The  method  ul  .such  .application  is  analogous  to  the  analytical 
processing  of  speiu  h.  by  .v  general  purpose  recognizer  and  suggests  the 
eoiiiparati ve  ailvantage  ot  using  eujihonie  rules  rather  than  word-unit 
recognizers  nf  speech. 


74 


In  our  research  recordings  of  carefully  articulated  speech  and 
of  normal  conversational  speech  were  studied;  the  respective  texts 
were  “What  is  a  Boy?“  and  “What  is  a  Girl?"  read  on  a  45  rpm  record 
by  Jackie  Gleason,  and  experimental  tape  recordings  of  conversation 
made  by  Dr.  J,  M*  Pickett  in  an  anechoic  chamber  at  Hanscom  Air 
Force  Base,  Bedford,  Massachusetts,  Phonetic  transcriptions  by  ear 
of  selected  portions  of  this  speech  were  then  made  by  a  phonetician  to 
discover  examples  of  euphonic  combination  previously  suggested.  A 
summary  phonetic  analysis  of  the  carefully  articulated  and  the  conversa¬ 
tional  speech  yielded  examples  of  at  least  twelve  different  forms  of 
euphonic  combination  previously  suggested.  Among  these  examples 
(cited  in  Appendix  P)\«ere  significant  confirmation  for  changes  in 
resonance,  changes  in  place  of  articulation,  and  the  .  joining  of  the 
final  phone  in  one  word  to  the  initial  phone  of  the  word  following, 

Such  auditory  analysis,  it  should  be  noted,  is  only  a  proliminiiry 
test  to  suggest  the  value  of  previous  research  to  speech  recognition  in 
terms  of  our  model.  Additional  rules  may  be  confirmed  by  subjecting 
the  speech  waves  to  careful  acoustic  instrumental  analysis.  By  ear 
alonoi-.fo.r  cxaitipie,  it  is  difficult  to  identify  the  exislcnee  of  full  stop 
closures,  or  place  of  articulation,  as  illustrated  in  the  differences 
between  dental  and  alveolar  n.  Further  data  will  rely  on  acemstic  measure¬ 
ment  as  well  as  phonetic  transcription. 

The  Multidimensional  Model  for  oj'gani r.ing  speiich  inrormaLion 
represents  a  departure  from  Western  inetlH^ds  of  describing  speech 
and  constructing  speech  recognizers.  We  reject  the  asKumption  that 
it  is  possible  to  conslruet  a  phoneme  recognizer  or  one  wliich  can 
recognize  words  as  soparal<?  units  consisting  of  phonemes.  Kvidunce; 
indicates  that  adja<-enl  plioncs  Influence  cai  h  others’  a  rliculation;  thus 
it  docs  not  seem  feasible  that  ritceived  sounds  be  transinilled  directly 
to  a  dictionary  of  stored  acoustic  information.  The  preUn\inary  trans- 
riptiun  wf  spiUii  I'l  u'lii&l  lirsl  l>c  Subjected  to  criteria  for  sound  change 
and  I'uphonic  combination,  criteria  wlic'se  rclalioii  to  tlie  dimensions 
of  place  and  manner  of  urtieulalion ,  inleosily,  duration,  and  resoa.tiK  ».•, 
has  been  mentioned  earlier  in  this  report.  Such  initial  proces.slng  [nv^vides 
an  efficient  and  necessary  nietliod  for  solving  probleiVJs  such  as  those 
caused  by  sound  chang(?s  williin  polysyllabic  words  with  the  plioncs  of 
adjacent  words. 

Fssenlially  lln?  operations  of  sucli  a  recognizer  as  that  described 
above  may  be  segmenlcd  into  four  main  sli’ps  (1)  symbolic  ri.'pr csentation 
o^  rules  received;  (^)  a|>pUeatii>n  of  ruli^s  of  euphonic  combination  to 
soparale  words  slurred  together  tluring  pronunciation;  (3)  processing  tlu'si? 
units  through  an  uleelronic  "dictionary"  to  identify  their  meaning;  and 
(4)  written  Irans^- riptioii  of  speech,  Rcscarcli  discussed  in  Section  3 


75 


indicates  the  feasibility  of  representing  the  sounds  of  speech  production 
symbolically.  The  phonetic  analysis  of  continuous  and  carefully 
articulated  speech  cited  above  indicates  that  such  transcription  using 
orderly  rules  of  euphonic  combination  may  be  necessary  for  an  accurate 
rendition  of  speech  as  it  is  generally  produced.  Such  an  assertion  is 
reinforced  by  the  need  to  analyze  speech  in  detail  for  identification  of 
slurred  sounds,  by  the  proven  ability  of  rules  previously  developed  to 
suggest  euphonic  combination  occuring  in  randomly  chosen  texts,  and 
by  the  need  for  rectifying  errors  in  transcription  or  identification  through 
an  intermediary  that  can  apply  acoustic  data  to  speech  transcription 
and  provide  alternatives  to  combinations  of  sounds  that  the  ’’dictionary’’ 
cannot  idenlify. 

Such  a  concept,  it  will  be  noted,  is  in  conflict  with  theories  of 
speech  recognition  that  rely  on  identification  of  word  units  alone,  These 
theories  assume  it  may  be  possible  to  achieve  an  effective  recognizer  by 
constructing  a  dictionary  to  contain  the  stored  acoustic  pattern  of  words 
most  commonly  used  in  English  up  to  ten  thousand  words.  Although 
such  a  recognizer  might  be  used  to  identify  ilie  extremely  careful  articu¬ 
lations  of  a  few  highly  trained  individuals,  its  application  to  general 
speech  must  necessarily  be  complicated  by  euphonic  combinations  and 
individual  varieties  of  pronunciation.  These  problems  arc  discussed 
below. 


In  research  with  continuous  and  «:ari*fuily  articulated  speecli  by 
linguist  s  and  ph(jnel icians  i(  has  been  a  general  observation  that  euphonic 
combination  and  coarlu  ulation  are  natural  phenomena  of  speech;  a 
corrulary  of  llus  observation  is  that  unstnisseU  words  art!  usually  cuin- 
bined  with  adjacent  weeds.  In  accordance  witli  Iht  si*  criteria  it  may 
be  noted  that  such  words  as  to,  it’s  or  is  are  incorporated  into  the 
articulation  of  surrouiuling  words  iii  such  plirascs  as  t’ learn,  ' s  ton  hot, 
or  *s  lliata  fad? 

While  it  is  possible  to  assume  that  siu  li  small  unslresscd  words 
tan  br  articulated  one  at  a  time,  some  analysis  ai\d  prat;lit:e  will  satisy 
the  reatler  lluit  such  arlit  ulaliun  is  unnatural  and  diffieull  even  for  a 
trciined  spt^aker.  Word-unit  recegnize rs ,  altliough  expensive'  to  t  on- 
.struct,  could  thus  operate  only  under  sjx  eial  eonclilions  and  eoultl  not 
be  applied  to  many  speech  situations.  An  add  itional  Umitatlon  to  the 
potential  v^llue  of  word-unit,  recognizers  is  suggested  by  the  fact  that 
tin;  aiTiculalioii  of  a  particular  phone  dcptmds  on  many  variable  fat  tors 
based  on  the  phy.siceil  conditions  of  arliculalion.  Such  factors  include 
intensity,  duration,  resonance,  and  place  and  manner  of  ailK  illation, 
whose  importance  has  already  emphasized  in  their  organization 


76 


as  dimension^  of  our  model;  because  of  their  effect  on  speech  production 
it  is  to  be  expected  that  no  two  speakers  will  pronounce  the  same  word  in 
exactly  the  same  fashion;  it  is  probable  that  individual  articulations 
will  often  be  sufficiently  distinct  that  they  produce  a  variety  of  acoustic 
patterns  not  readily  related. 

Substantiation  for  the  effect  of  such  dimensions  cited  above  is 
indicated  by  coarticulation,  varied  emphasis,  and  euphonic  combination 
discovered  through  phonetic  triinsc ription  of  conversatiai  and  care¬ 
fully  articulated  speech.  From  the  careful  speech  of  one  speaker, 
for  example,  it  was  possible  to  record  six  phonetically  distinct  pro¬ 
nunciations  of  the  word  of  within  a  one -minute  interval,  particularly 
as  it  occurred  in  the  phrase  of  every.  Such  variations  depended  par¬ 
ticularly  oil  word  placement  within  a  sentence,  stress,  and  the  rhythm 
of  the  speaker  in  voicing  his  ideas.  Phonetic  transcription  ahso  yielded 
examples  of  a  euphonic  combination  within  words,  as  in  tlie  loss  of  h  in 
grasshopper  ,  the  Iransformation  of  a  voirrd  h  to  a  voiceless  p  in 
absolutely  --  both  examples  drawn  from  our  recording  of  carelul 
.speech  --  or  the  substitution  of  a  glottal  stop  for  a  t  before  a  labial 
in  the  word  voltmeter  in  the  Manscom  reiturding  (d  conversational  speech. 

From  the  data  discussed  abov(^  several  observations  may  be 
suggested.  The  first  is  the  difficulty  of  applying  lecluiiques  used  in 
word-unit  recogni^crs  to  tl»e  transcription  of  general  sptu*ch,  I'Jvidence 
for  such  difficulty  may  be  found  particularly  in  tiu'  number  of  slurs 
aud  euphonic  c.onil>inations  both  between  words  and  within  polysyllabic 
words. 

A  second  observation  reialc?s  to  the  problems  inhc'rent  in  arlie- 
ulating  vliscretc  speech  for  tran.si- ription  by  a  wm'cl-iinil  reeogni/.er. 

Even  in  careful  speech  there  may  be*  many  iustanc(!s  of  cuiphonic  com¬ 
binations  bi'lweeii  woj  d.s;  and  it  may  aioo  bi*  particularly  difficult  while 
articulating  .speech  to  be  transcribed  by  a  unit  rccogni/.er  to  avoid  the 
iiK^vitable  euphonic  combinaliuns  that  take  place  within  words,  an  in 
tlic  loss  of  t  ill  softness  from  the  Gleason  rect»rding. 

Additional  problems  fur  word-unit  reuogni/.ers  are  also  iiidieated 
in  the  facts  that  one  speaki!!*  may  proiuniiu  e  the  saim*  combinatLon  of 
wor<ls  in  st*vi!rai  tUfferenl  ways,  an  in  the  j)hrase  of  every  from  Gleason, 
while  two  speakers  will  almost  certainly  pronounce  various  identical 
words  in  ways  which  are  phoiicUi'ally  and  acoustically  distinct;  phonetic 
Lransc ription  indicates  this  to  be  the  case  with  tlie  word  ears  pro¬ 
nounced  by  two  different  speakers  in  the  Haascom  recordings  (Passaged). 


77 


If  we  were  to  assuiite  arbitrarily  that  factors  discussed  above  caused 
only  eight  possible  modifications  of  stored  acoustic  patterns  for  given 
words,  it  is  apparent  that  word-unit  recognizers  would  need  to  contain 
a  considerable  amount  of  redundant  data.  By  organizing  such  data 
according  to  divisions  which  our  model  has  been  using,  however, 
we  intend  to  improve  the  efficiency  in  speech  transcription  while  applying 
our  knowledge  of  euphonic  combination  to  a  larger  syllabus  of  words  than 
that  available  with  unit  recognizers.  Experiments  discussed  above,  it 
may  be  noted,  simply  provide  an  initial  outline  of  how  our  multidimens¬ 
ional  analysis  is  substantiated  in  its  application  to  practical  problems  of 
word  recognition. 

Several  necessary  lines  of  further  investigation  are  evident. 

Among  these  are  complete  compilation  of  a  symbolic  representation  of 
speech  production,  generation  of  additional  rules  of  euphonic  combination, 
and  generation  of  our  own  data  describing  the  relations  between  acoustic 
and  phonetic  aspects  of  speech.  These  steps  will  form  the  basis  for  our 
continued  investigation. 

In  the  available  evidence,  there  seems  to  be  considerable  justifi¬ 
cation  for  a  unified  approach  to  speecli  analysis,  based  on  the  genetic, 
phonetic,  linguistic,  and  acoustic  aspects  of  spi’ech.  In  order  to  obtain 
acoustic  and  phonetic  substantiation  ior  treating  the  articulation  of 
phonos  as  complex  phenomena  described  by  an  orderly  set  of  rules 
based  on  various  physical  means  of  productiorv  additional  data  must  be 
examined.  In  this  subsection  we  present  and  discuss  inlormatioii  we 
have  generated  concerning  tlie  acoustic  correlates  of  phone  classes. 

This  data,  which  had  previously  been  undefined  c)r  even  uniden- 
lificid,  is  essential  to  thi^  comurplual  completeness  of  our  model  for 
speech  recognition.  With  our  increased  knowledge  of  these  phone  classes, 
we  are  better  al>I(’  lo  categorize  them.  T’lie  ace  n  rar  y  witli  which  speech 
segments  can  be  identified  is  increased  and  llie  number  of  choices 
required  lo  identify  ;i  sound  is  reduced.  Moreover,  such  information 
suggust.s  that  minor  adjustnieiils  can  be  made  itt  certain  rules,  adjust¬ 
ments  which  would  increase  efficiency  --  both  by  making  certain  rules 
more  widely  applii:able  and  by  refining  other  rules  to  apply  lo  special 
circumstances.  Certainly  this  new  cunpirical  ittformalion  describes 
only  a  limited  number  of  phone  classes  and  is  not  yet  complete  enough  - 
in  a  statistical  seiistt  -  lo  niak,’  adjustments  obligatory  in  t:erlain  cases. 

But  this  infot’niation  neverllicTess  broaches  areas  tliat  we  would  like 
lo  study  tor  making  additional  refinement.  Although  the  small  amount 
of  data,  which  is  limited  to  demonstrating  peculiarities  of  speech  events, 
constrains  tmr  conclusions,  we  do  present  a  small  sampling  of  rule 


78 


occurrence.  This  represents  the  first  information  we  have  been 
able  to  gather  on  the  possibility  of  rules  for  more  effecient  computer 
use.  This  possibility  must  be  investigated,  by  obtaining  additional 
information  about  the  probability  of  their  being  operative,  in  circum¬ 
stances  where  they  had  previously  been  mentioned  to  operate. 

Tha  fact  that  we  could  correlate  our  analysis  of  acoustic 
modification  as  represented  on  spectrograms  and  time  amplitude  plots 
with  our  perception  of  how  and  when  sound  change  actually  occurred 
made  the  generation  of  our  own  data  extremely  worthwhile.  The  need  for 
such  correlation  has  been  indicated  a  number  of  times  when  discussing 
our  use  of  source  data.  This  work  is  an  extension  of  acoustic  .studies 
using  joined  words.  Thus,  one  result  of  our  present  data  analysis  is 
the  demonstrable  empirical  proof  it  provides  for  tlu^  dtjduclive 
reasoning  -  based  on  source  data  -  by  whicli  wu  evolved  our  rules  of 
euphonic  combination  and  our  concept  of  coart’cuiation.  This  eJiables  us 
to  proceed  with  even  greater  assurance  in  the  construction  of  our-  model. 
Our  data  analysis  also  confirms  llial  tlje  concept  of  eoarticulation  is 
essential  in  de'seribing  speech  production. 

Whereas  our  eonei^rn  in  (niplionic  combitiation  is  with  sueli 

\  broad  problems  as  llu*  eU.sion  of  a  sound  vv  llu-  gi-neral  fusion  of  sounds, 

I  our  concern  in  dealing  with  c’uarticulatlon  includes  Ihi-  nunute  influences 

»:  of  one  sound  on  another  in  its  environment.  Our  descriptions  of  sueii 

h  minute  influences  in  the  followijig  discussion  an-  intended  to  verify 

^  <-oarticulation  in  ge-neral:  but  it  is  necessary  to  exteiul  such  isolated 

^  contributions  into  a  system  of  rules  --  Ihrougb  eontinuecl  researeli  on 

the  characlerislit  s  of  ci^arlieulation,  By  gi-iu- ratiiig  more-  objecl- 
direcl.<*<l  data  su<h  us  this,  we  may  be  able  to  iMiconijiass  details  with 
rule.s,  uitiinat<-iy  reducing  tiu*  number  of  rules  -  ay  we  havi;  cluim  witli 
euphonic  combination  -  to  tin*  point  where  a  coi)\puter  can  st(^re  anrl 
apply  them.  We  a  time  wheti  we  will  l)e  alih-  fo  increase  the 

efli<ii-iicy  of  spee-h  transcriplioji  hy  aiillc  i  ))aLi  ng  all  c<.>ai'tieulation 
through  stored  rules  of  ».  oaiTieul.ilioin  'I’he  development  of  such  rvilcs 
liowevef,  would  be  u  study  in  ilsell. 

In  adtiilion.  coti  rt  iiu  lalion  as  an  appruat  h  to  segmentation  is  of 
great  iii'pcjrtanee:  our  division  ol  will  you  Into  the  coart  icu  lat  ory 
Hcginenl.s  [wi][_llyou]  is  an  example  of  .sijgmenlation  by  coarlicvilated 
entities,  a  n<‘W  approaidi  to  si-gmentaliun.  So  the  priju  iple  of  eoar¬ 
ticulation  will  be  useful  to  a  maciuue  nut  only  in  Ihe  reception  but  in  tlu- 
"  analysis  and  segnienlatiun  of  sounds  .'ii  the  spcei  li  flow. 

As  a  I'i-sult  ol  our  analysis  of  the  i  haracterlstu  s  of  \arious 
s]>iech  segments  --  and  Llu-  [iiU'cisc  rt'prcseiuauon  which  would  niaki.-  them 


79 


suitable  for  computer  programming  --  we  are  able  to  incorporate  the 
phone  classes  £_,  1,  and  into  our  model-.  At  the  start  of  our  work, 
we  were  not  certain  of  how  these  segments  should  be  positioned.  Thus 
in  Section  1  li/e  defauvred  a  discussion  of  these  "problem  segments"  Our 
work  with  precise  representations  has  helped  us  in  defining  their  positions 
in  the  model. 

Finally  we  will  discuss  methods  by  which  our  rules  of  euphonic 
combination  anight  be  employed  in  a  working  computer  system.  {Of 
course  our  rules  will  also  be  used  in  programming  for  the  computer.  ) 
Once  the  be.st  method,  with  accompanying  protection,  verification, 
and  efficiency  techniques  has  been  decided,  the  novelty  and  usefulness 
of  such  a  program  --  even  outside  the  scope  of  developing  a  general 
purpose  speech  recognizer  --  cannot  be  minimized.  However,  the 
methods  must  be  selected  with  care.  At  the  present  we  are  certain 
only  o£  alternative  possible  methods,  each  of  which  has  its  deficiencies 
and  its  advantages.  We  present  them  in  summary  form  in  this  report, 
while  we  continue  to  work  out  more  detailed  problems  each  presents. 

To  develop  a  working  computer  system  is  beyond  the  scope  of  the 
present  study,  but  it  is  a  subject  that  merits  investigation. 

mSCUSSION  OF  SPEECH  DATA _ 

When  we  use  data  from  other  people's  literature,  a  good  amount 
of  time  is  required  in  tracking  down,  culling  and  reapplying  this  data 
to  fit  our  particular  needs.  Of  <  ourso,  we  are  occasionally  confronted 
with  instances  in  which  the  d,-ita  we  desire  is  cither  not  adequate  or 
not  available.  Evi-n  when  we-  are  abU-  to  gather  suffiblent  data  from  otlier 
research  efforts,  we  are  for  the  sake  of  accuracy  forced  to  ascertain 
the  precision  of  measurenu-nls  given.  And  in  a  number  of  instances, 
the  published  work  in  rneasuremeut  is  incomplelr  for  our  purposes. 

For  example,  no  one,  to  our  knowledge,  publiulies  information  derived 
from  the  use  of  liivie  amplitude  plots'  ailhough  we  luivu  found  them 
very  valuable,  their  usofulnes.s  appears  to  have  been  overlooked  or 
ignored  by  others.  Furthermore,  other  people's  data  provides  no 
inforination  about  the  effects  of  intensity  on  the  coarliculation  of  phone 
classes  -  for  no  one  reports  inlensily  in  a  form  wo  can  use:  such 
Information  is  vital  to  us  since  we  wish  to  know  what  p.art  inlensily 
has  ill  distinguishing  phone  i-lasses;  we  have  already  staled  that  duration 
may  differentiate  sounds  (such  as  the  vowels  of  bomb  and  balm  in 
certain  American  dialects)  but  we  must  know  the  effect  of  intensity 
on  duration  before  we  can  develop  a  recognition  plan  as  sensitive  as  the 
one  our  aims  require.  Finally:  few  people  liave  recognized  coarticul¬ 
ation:  the  only  available  studies  of  coiirliculation  de,al  with  extremely 
isolated  cases:  so  we  are  foreed  to  execute  our  own  tests  in  order  to 


HO 


i 


I 


study  acoustical  data  for  the  speech  phenomenon  central  to  our  work. 

In  fact,  whenever  possible,  data  generated  for  any  experiment  was 
subjected  to  analysis  for  the  study  of  coarticulation. 

B.  DATA  ON  "PROBLEM"  PHONE  CLASSES 

In  Section  1  we  deferred  our  discussion  of  certain  problem  segments. 
There  we  mentioned  the  greatly  variant  h  and  the  elusive  complex 
semivowels  y,  w,  (j,  m,  n,  1„  Work  on  our  computer  program  h.as 
led  us  to  include  h  and  r  with  the  semivowels;  like  the  semivowels, 
among  other  reasons  for  this,  h  and  r,  can  only  occur  before  or  after 
vowel  sounds  (barring  isolated  exception  such  as  h  w  in  where, 
phonetically  transcribed  hwere.  )  When  writing  our  first  report,  we 
were  not  certain  about  the  classification  of  the  semivowels  as  a  whole: 
should  they  be  treated  as  consonants?  or  as  a  separate  manner  of 
articulation?  Now  il  seems  likely  that  wc  will  treat  them  as  a  part  of 
the  vowel  cluster  in  which  they  occur.  For  example,  the  word  crush 
would  be  segmented  cru  sh  .  Of  course  we  are  faced  with  many 
vexing  questions  beyond  the  matter  of  workable  elassifieation.  For 
instance  there  is  the  problem  of  the  doulitful  cxisle.ice  of  semivowels 
in  particular  environments.  We  cannot  assume  they  exist  because  of 
our  orthographic  tradition.  Consider  characteristic u  of  "y".  There 
is  little  acoustical  doubt  that  absolute  initial  y  exists.  But  in  some 
environmental  circumstances  -  particularly  when  it  is  by  or  between 
[ijsounds  -  there  is  no  immediate  cleareul  evidence  of  its  true  presence. 

If  il  does  exist,  liow  is  it  represented  aeoiistii  ally  ?  Are  there  sound 
waveform  nianifeslatioiis  lliat  macliines  can  identify?  Anotinm 
problem  is  that  tlie  nature  or  the  oceum  iici:  of  ;i  semivowel  may  vary 
witli  speaker  or  dialect.  Can  we  develop  rules  to  prediet  these  variations? 
and  to  aeeount  for  them?  First  v,e  must  have  a  more  definite  idea  of 
the  acoustic  and  articulatory  propurlies  of  cacli  semivowel.  This  is 
an  immense  task  beyond  tin-  scope  ol  our  present  study.  In  our  initial 
riiport  We  fell  that  we  would  be  able  to  learn  more  about  h,  r,  ij  than 
wc!  cictually  were  aljle  t<i  U*arn.  On  the  oilier  liantl,  we  fcdl  llu-n  that 
we  actually  would  not  be  able  to  treat  vocalic  or  "syllabii:"  nasals  at  .ill  - 
and  further  on  in  tliis  scidioii  we  [iresenl  data  on  tlie  vocalic  nasals.  Below 
we  discuss  our  work  on  the  w  ami  y  plioiie  i:lasses. 

In  our  work  witli  tin-  w  and  y  plioiie  classes,  we  I'.a'used  our 
attention  on  those  environments  in  which  a.  w  or  a  y  might  or  might 
not  e.xisl.  Evidence  of  their  existence  in  such  eiiv  i  ronine  ut  s  wuultl  lie 
good  I’videnee  of  the  basic  acoustic  i:harae  le ri  s li i' s  of  Ihi'se  jihoiic  i  lasses. 
And  the  latter  is  our  ultimate  concern. 


HI 


1.  The  w  Phone  Class 

For  this  test,  a  phrase  containing  w  was  compared  with  a  phrase 
not  containing  w  but  otherwise  identical.  The  particular  phrases  choses 
were  "no  ax"  and  "no  wax.  “  These  phrases  were  particularly  well  suited 
for  our  study  because  the  frequency  level  of  F,  (second  formant)  at  the 
end  of  the  vowel  of  'no'  is  very  close  to  the  F^  1-vel  of  w.  In  fact,  the 
similarity  between  this  vowel  and  w  is  not  only  acoustical,  but  also 
articulatory;  for  both  require  rounding  of  the  lips  for  speech  production. 
Therefore,  if  present  here,  w  is  forced  to  distinguish  itself,  to  make 
itself  known. 

The  selected  phrases  were  incorporated  in  the  sentences  "we 
have  no  ax"  and  "we  have  no  wax.  "  The  sentence  containing  "no  ax" 
was  included  in  a  list  of  sentences  which  five  informants  read  at  the 
beginning  of  the  recording  session.  After  that,  the  same  informants 
read  a  list  of  throe  or  more  sentences,  including  the  "no  wax"  one. 

There  was  an  interim  of  at  least  twenty  minutes  between  the  reading  of 
"no  ax"  and  "no  wax.  "  This  was  to  deter  the  possibility  that  the  informants 
exaggerate  differenees  between  the  phrases.  (See  Figures  21-26  for  w 
and  Figures  27-30  for  y.  Sec  also  Table  1  for  a  detailed  description  of 
the  duration  changes  for  "no  ax  -  no  wax"  for  each  speaker  and  Table  2 
for  a  close  measurement  of  "no  ax  -  no  wax"  frequency  clianges,  Com- 
p,araljle  for  ^  study  are  Tables  3  and  4.  ) 

After  making  tape  riK:ordings  of  tliese  reading.s  in  studio,  we  made 
spectrograms  of  the  seniences  that  conci  ined  us  on  a  Kay  Sonagraph, 
Spectrograms  were  iimde  for  the  speech  of  all  five  speakers,  but  those 
made  for  Speakers  3  and  4  were  not  ineasiired;  th.ese  infonnants'were 
womeji  wifi,  higli-pilelied  ■.  oici-s  and  -wc  found  it  difficult  to  make  formant 
measurements  accurate  enough  lu  be  at  all  com  lusive. 

Differences  between  these  two  phrases  were  similar  in  enunciation 
by  each  of  tile  tlil'ue  speakers.  Tn  fill  e.'ises  "no  w.'ix"  h.as  a.  st  early- st fit  t. 
in  which  tiie  si'coiid  formant  is  .'it  a  very  low  t'requeiicy  and  has  very  weak 
intensity  (see  Figures  21,  2  3,  &  26).  The  duration  ol  the  steady-stale 
ranges  -  with  difliTimt  spiuikers  -  from  66  lo  lOfi  milliseconds.  Spi'aker 
2's  neonuncialion  of  "no  ax"  has  a  second  formant  steady-state  of  similar 
frequeni  y  and  intensity;  the  duration  of  this  steady- stale  is  32  millisecoiuls. 
SpeakiTs  1  find  6  pronouncerl  "no  ax"  witli  .i  sVeaiiy -  stale  in  wdiich  the 
seeond  formant  is  .it  a  siighlly  higher  frequency  and  has  miu  li  gri^ater 
intensity  (set:  Figures  22,  24,  jtr  26).  The  duration  of  this  steady-slate 


82 


!;:iP 


ranges  from  23  to  26  milliseconds.  For  both  Speaker  1  and  Speaker  5 
the  vowel  onglide  that  follows  the  'bio  ax"  steady-state  is  interrupted  by 
a  pause  {a  period  during  which  ail  formants  are  greatly  reduced  in 
intensity).  This  pause  lasts  65  milliseconds  for  one  speaker  and  70 
milliseconds  for  the  other. 

So  the  differences  between  "no  wax"  and  "no  ax"  are  summarized  as 
follows;  for  Speaker  2  there  is  a  significant  duration  difference  between 
the  two  sleady-slatcs;  for  Speakers  1  and  3  the  apparent  differences  are 
in  the  duration,  the  freqiumcy  level,  and  the  intensity  of  the  second  formant 
steady-state  as  well  as  the  pause  in  the  following  onglide.  All  those 
differences  identify  w.  We  still  need  to  refine  this  idcnlificatiort  by  gatherin 
more  information  -  particularly  about  differences  in  transition  (See 
Appendix  Q). 

2,  The  y  Phone  Class 

The  method  employed  for  investig^ilion  of  y  was  the  same  used  in 
tlie  study  of  w.  In  fact,  llie  two  experiments  were  done  with  the  same 
informants  at  the  same  recording  session.  In  our  study  of  y  W’o  used  llu.; 
phrases  "ihr<*e  ears"  and  "lhre<’  years"  as  incorporated  in  the  Kcniences 
"No  animal  has  thrive  ears"  and  "It  lasted  tl)ree  years.  "  It  was  not  until 
some  time  after  the  experiment  that  SpeukiT  2's  sped  rograms  w«’re 
f<.>und  to  be  iinperfeet.  And  our  analysis  ol’  tlu’  sjjce'.h  of  Speakers  1  and  5 
{th<!  only  two  remaining  in  llu;  y  solution  attempt)  put  us  no  i  loser  lu  an 
understanding  of  llie  clislinetions  lluit  vc’i’ify  y's  presence  than  we  were  at 
tiu'  outset.  Examination  of  liu*  spc'd rog ram s  for  lliese  Iwo  speakers 
lujvealed  no  consislvuU  <liffiTem  i’.s  belwtum  the  })i)r.L.si'  svilh  y  and  Ihe  ]^hra.si’ 
witlioiit  it.  Or>  t!u‘  other  haiul,  tln^  spectrograms  of  llu’  two  phrases  show  a 
major  similarity.  For  bolh  sj>eakers  (lie  re  is  only  one  vowel  stearly  -  state 
for  the  <'nlir<.*  phrase;  aUlu»ugl\  for  Speaker  1  tliis  sti-acly-slale  is  Lnt<*  rru|.)U-cl 
by  a  pause  in  "tiiret*  ears."  ('i'his  sliu'idy-stale  is  »u>t  entirely  level;  il 
shov/H  a  sligin  ris'-  in  a.ll  ca.Hcs.  )  .See  Figure  27. {but  see  Appendix  I..  !^!2). 

WluL(  if  a  (lifferrenee  between  the  two  pJirasc'S  dot's  not  exist?  What 
if  y  disappear.s  in  ihi.s  <?nvironmenl  leaving  us  with  lummnyms?  lHow  do 
wc  teaeh  the  machine  to  solv<'  the  honuJiiym  problem?  It  would  have  to 
ignore  information  Dial  may  be  pertim.mL  elsi'where.  Frankley,  wi?  would 
prefer  to  eliminate  homonyms  --  to  piujve  that  in  evirry  ease  there  are 
signifiirant  identifiable  tii  fte  lU'nee  s ,  but  we  may  luive  to  recognise  l!iat 
in  .some  inslaeirc's  it  will  be  impos.^ible  to  eliminatt'  stu  b  homonyms. 

3,  Syllabii*  Nasals 

We  have  a  speetrograiii  and  a  time -ampl it utle  idol  of  n  syllabie  n 
in  tile  word  kitttm  spoken  by  Sp<’aker  1  (Si  e  Figures  M  and  32),  'J'he  spedru 


gram  shows  that  this  ^has  a  strong  F  at  300  CPS  and  another  fairly 
strong  formant  (probably  F^)  at  around  5000  CPS.  All  the  formants  in 
between  are  extremely  weak.  Other  n's  by  the  same  speaker  have  more 
energy  between  300  and  5000  (see  Figures  21,  22.  27).  The  time -amplitude 
plot  o,}. the  syllabic  ii  shows  a  wave-form  which  is  very  different  from  that 
of  an  ordinary  ti  spoken  by  any  of  our  informants  for  this  research  (see 
Figure  33  for  an  example  by  comparison).  We  have  a  time-amplitude  plot 
from  earlier  work  which  shows  an  n^with  similar  syllabic  waveform  in 
the  phrase  "moon"  (See  Figure  33).  This  r^is  followed  by  an  ^  with  an 
ordinary  waveform. 

Incidentally  the  reader  will  notice  the  brief  vowel-like  portion 
immediately  following  the  syllabic  nasal  on  our  spectrogram.  This  portion 
may  be  the  result  of  releasing  t.he  oral  closure  before  the  celum  is  closed. 

4.  A  New  Vocalic  Portion 

The  matter  of  this  unclassified  speech-sound  is  so  problematical 
that  a  linguist  transcribing  speech  generally  overlook.s  or  ignores  't. 

But  a  speech  transcribing  machine  could  not  ignore  it  unless  instructed. 

Not  only  does  this  vocalic  portion  show  up  on  a  spectrogram,  but  it  has 
very  distinctive  characteristics  on  a  time -ampli tude  plot  (sec  Figures 
31  Cl  32).  Furthermore,  il  appears  I'rccpienlly  so  in  build'ng  a  speech 
recognizer  we  must  plan  our  dat.i  to  allow  for  its  occurrences. 

A  tcnlativc  explanation  of  the  existence  of  this  spcec:h  sound  is  the 
sudden  closure  ol  llic  velum  wliili’  sounding  of  a  nasal  is  not  complete. 

'I'ake  the  word  kitten  foe  example.  Tlicre  is  little  change  in  tongue  position 
for  the  t  and  the  n,  both  being  aveoiar.  At  Uie  end  ol  the  t  sound's 
fricUoii  the  velum  opens  all  the  way;  and  this  re  u1I..h  in  the  n-hsal  resonance 
we  identify  as  n  (Nasality  occurs  when  inori’  air  flows  llirough  the  nasal 
passage  tlian  llows  through  the  mouth  cavity.)  But  the  velum  closes  before 
il  was  lo  close  for  the  end  ol  the  nasal;  there  is  an  accidental  flow  of  air 
tlireugli  the  moiilli  cuvily  creating  a  new  phone  class.  Such  a  phone  class 
may  have  spectral  characteristics  tiuil  are  similar  lo  those  of  a  nasalized 
vowel  artii  nialed  in  a  similar  manner,  as  discussed  by  Fant.  Tills  situ¬ 
ation  occurs  when  the  monlh  cavity  iinpedence  is  comparable  to  that  of 
tlu^  nasal  cavity  coupled  lo  Ih'c  oral  passage  liy  a  liinilerl  opening  of  the 
veliuii. 


Since  the  vocalic  portion  la.st.s  only  from  3  -4  pilch  periods,  it  is 
diffu'iilt  to  dcline,  with  reliability,  its  spectral,  forinanl  charac teri s tic s 
(.see  Figures  51  K,  32)  or  other  aspects  ol  its  waveform.  Yet  it  is  difficult 


84 


to  instruct  a  machine  to  ignore  this  short-duration  vowel.  Three  or  four 
pitch  periods  is  the  duration  of  the  (I)  vowel  in  the  word  animal  (see  Figures 
27  29).  If  the  machine  ignores  the  vowel-like  portion,  it  will  ignore 

the  vowel  (l)  in  the  word  animal.  We  can  construct  a  valid  workable  rule 
only  if  we  give  the  machine  more  information  about  the  vocalic  portion 
than  its  duration  alone;  information  that  will  show  how  this  non.sense  segment 
is  different  from  cognitive  segments.  Tentatively,  our  rule  in  this  instance 
would  state  that  after  a  nasal  any  vowel-like  segment  of  five  pitch  periods 
or  less  must  be  ignored  unless  the  segment  following  that  one  is  a  nasal 
(animal  );  or  unless  it  is  sandwiched  between  two  voiceless  plosives  (like 
pit  );  or  between  a  voiceless  plosive  and  a  nasal  (pin);  or  vice  versa  (nip). 

The  occurrence  of  phetu>im*na  like  this  vocalic  portion  help  to 
emphasize  the  problems  of  people  at  work  on  phoneme  recognizers.  Tliey 
can  teach  their  equipment  to  ignore  such  phenomena  --  but  it  is  necessary 
first  to  understand  and  to  classify  the  phenomena  beff>vehand. 

5.  Classification  of  Problem  Segments 

After  having  examined  the  a<’  uistic  dalei  on  certain  })Vol)b'in  segnn'nlH, 
wo  arc  better  equipped  to  ujiderlake  llieir  claHsific-Hioii  In  il.c  multi¬ 
dimensional  model.  In  Section  i  we  listed  h,  v,  y,  w,  and  vocal  m,  n, 
ij  and  1,  as  consonants  wincij  wH‘r<*  difficult  to  fit  into  the  inodri  (we  are 
calling  those  consonants  simply  Ix-cause  tliey  occur  in  those  parts  of  words 
more  often  occupied  by  t:onsonants  Dian  l>y  vowels  i  they  were  not  (Tassi- 
fied  as  consonants  on  any  acoustic  l)a.sis). 

'The  problem  of  li  w<oj  relatively  simple  to  solve  hecausi  the 
consonants  in  the  iikhIcI  are  grouj>c*d  with  their  following  vowels,  the 
fact  that  the  place  of  aiTiculalLon  of  h  (and  con smjueutly  tlui  quality  of  h) 
changes  witli  every  v<»wel  in  no  longer  prohlematical.  W  (*  Ums  (  las.sifit'd 
li  as  a  voiceless  V(.)wel  -  <.>r  the  voiceless  jHirlion  ol  whatever  vowel  follows 
it.  Kxamples  from  Visible  SpcM^h  (C'dmpli’r  9.  Unil  I,  pj).  ITJ-IJH)  show 
tfiat  the  frinjuein  y  of  h  is  lli<'  same  ajs  Die  frefjui'ncy  ol.  tlu'  steady -slat  e  oL 
tlu'  foll<.»wing  vowel  -  ex<.  epl  llial,  in  most  cases,  llie  h  i  .'-j  unvoiced. 
lliJWevt'r,  helweeii  two  vowels  (a-s  in  the  senleiui-  "Will  you  help  us?",  the 
li  may  be  v<.»u  e<l.  'Tills  voK  e<t  h  is  a  spin  i.il  category  ol  si)eech  with 
measuralde  <'ha  rac  t«*  ri  stic  s  aiul  must  lU'ceive  special  Ire.ilinenl. 

'The  Internal  iiuial  Flumv'lii  Alphabet  iiuoitionH  only  three  \nwel-r 
combiiuiliony ,  all  of  which  are  i  loseLy  rel.iled;  llie^  sound  In  cliurch, 
the  3  sound  in  bin)  as  pronouiii  e<l  by  a  Soutliern  American  or  an  Tuiglish- 
man  and  llie  V  ii‘  belterand  similar  unslri-ssed  syllables. 


We  believe,  however,  that  lor  the  identification  of  r  j(as  well 
as  certain  vowel  sounds)  it  is  important  to  notice  that  the  vowel 
sounds  in  such  words  as  art,  glare,  fear  (or  true,  tray,  trouble) 
can  be  diphthongized  with  r,  so  that  a  machine  may  not  easily  dis¬ 
tinguish  where  one  sound  ends  and  the  other  begins.  The  transition 
from  the  steady-state  of  the  vowel  to  the  steady-state  of  the  £  (or 
vice  versa)  seems  an  important  clue  to  the  recognition  both  of  the 
vowel  and  of  r.  We  therefore  treat  r  as  a  portion  of  the  vowel  (i.  e. 
a  vowel  cluster).  However,  an  additional  Riemann  leaf  should  be 
included  in  the  model  to  indicate  the  retroflex  manner  of  articulation. 

In  the  acoustic  representations  of  the  sentences  '’No  animal 
has  three  ears,"  and  "It  lasted  tliree  years,  "  presented  above,  we 
tried  to  delerniine  whether  y  was  a  semi -vowel  pronounced  as  a 
diphthong  witli  the  following  vowel.  The  time  -  amplitude  plots  and 
spectrograms,  as  we  have  seen,  showed  that  no  y  can  be  conclusively 
chstinguishccl:  "Years"  and  "ears"  seem  to  be  acoustic  homonyms 
in  this  eonfexl. 

Ill  Uu'  scntenci’s  "We  have  no  ax"  and  "Wo  have  no  wax"  a  slight 
break  was  noficeable  in  the  second  formant  of  liie  "no.  ..ax"  .segment 
of  Mu'  first  sentoice:  no  such  pa,use  was  usually  proscnit  in  the  "no.  .  . 
wax"  portion  oi.  the  second  s<'n(<'nee.  Thus  it  seems  that  w  will  require 
special  rules  for  resolution  (and  possibly  will  require  the  use  of 
probabilty).  IloweV(?r,  if  seems  that  it  can  be  recognized. 

We  also  advise  treating  I  as  a  vowel  i  lustc  r;  again  a  separate 
lleimann  leaf  has  b<-en  included  in  the  model  to  spi'CJiy  IJie  lateral 
jnanuer  of  articulation.  I-^idenee  of  fretfucncy  anrl  duration  im^asure- 
menls  to  l)c*  numlioned  later  in  tins  s<’etion  substantiate  this  treatment. 

It  must  be  ]>ointe(l  out,  how<“Ver,  lijat  iri  careful  speech  one 
can  reliably  segment  an  I,  sueli  as  Gunnar  l-'anl  has  donti  ("Studies 
(jf  Minimal  Sound  Unils"),  )n*cause  Ihe  duration  and  the  frr>r|ueney 
mark  it  a  separate  entity  in  that  ease,  IMil  in  coJUinuous  speeeh 
I  often  last  no  longer  than  l.hrei  or  four  pil<’Ji  periods  and  fihows  no 
appreeial)le  ehajige  in  freijuenc y .  L''ruii\  a  genetive  point  of  view, 

<iiie  could  explain  this  l)y  nolijig  Mu.l  in  ctnUiniious  speecli,  beciuise 
of  Mie  lateral  manner  of  urlieul.ilion  for  I,  llie  longue  (k)es  not  bend 
enough  to  yield  ;i  significant  diff<-rencc  in  llu'  accnistic  representation, 
in  cuntiniiou.s  speeih,  then,  I  should  bu  treated  as  a  semi-vowel;  in 
careful  speei,  li,  it  can  justifiably  Ik-  t  reaU  r.l  as  a  eoiisouanl. 

h’ijially,  \  <»calie  in,  n  aii<|  ij  v.  ill  also  be  treaU-d  as  vowel  clusters, 
as  <le.seribc*cl  in  Ihi-  casi;  of  r,  VN  i:  liave  not  yet  worked  out  tlu-  particulars 


of  vowel  recognition,  as  such  an  endeavor  lies  beyond  the  scope  of  the 
present  study.  However,  we  have  several  general  sviggestions  concern¬ 
ing  vowels. 

(6)  Recognition  of  Vowels 

Recent  measurements  of  the  frequency  charactc '  iollcs  of  vowels 
have  indicated  that  it  is  difficult  to  distinguish  acoustic  areas  which 
correspond  to  the  traditional  phonetic  vowel  clashes.  The  variation 
in  frequency,  which  results  in  '-overlapping''  of  closely  related  vowel 
classes,  seems  to  be  the  result  of  changes  in  the  environment,  the  rate 
of  articulation,  the  intensity,  and  the  duration  of  certain  speech  sounds. 

For  example,  our  measurements  of  spei-trograms  in  Visible  Speech 
showed  that  the  I  sound  as  in  bit  ranges  from  1517  to  2041  cycles  per 
second  in  the  F_  steady -state,  as  the  consonant  environment  changes. 

This  evidence  does  not,  however,  contradict  the  vowel  recognition 
program  developed  by  Forgie  and  Forgie,  since  their  program  specifies 
a  limited  context  and  a  fixed  environment,  which  would  stahili^^e  the 
frequency  of  the  F^  steady-stale  for  a  given  vowel. 

In  normal  speech,  however,  vowel  sounds  occur  in  many 
environiiienls,  with  various  degrees  of  stress,  W'hieli  alters  the  rate  of 
articulation,  tlie  intensity,  and  the  duration  cliaraeteriKlics,  For  this 
reason,  it  seems  advisable  to  allow  for  an  F  variation;  ibis  can  he 
done  by  "hroad<’ning"  the  range  for  the  vowel  classes  (and  hence 
reducing  the  number  of  vow'cl  classes  tlie  machine  rucognlKcs),  J3ut 
although  the  "overlapping"  of  classes  would  be  greatly  reduced,  a 
eontc:x(ual  program  would  need  to  be  Cornmlalecl  to  select  the  correct 
vowel.  Such  a  program  is  being  developed  for  consonants;  by  means 
of  this  program  unallowable!  consonant  combinations  will  be  eliminated, 

To  develop  a  contexlual  program  tor  vow-els,  however,  is  beyond  the 
scope  of  Ihiw  study. 

C.  VFiUFICATlON  OF  COARTICIILATION  AND  KUPllONIC  COMBINATION 


(1)  "Will" 


In  one  test  of  coarticulalion  we  studied  Ihc  word  "will"  rcjjeatcel 
ny  the  same  speaker,  but  in  different  cnvironmeuls:  (1)  as  an  item 
isolated  on  a  word  list;  (2)  as  the  initial  word  in  the  isolate  d  sentence, 
"Will  you  help  us?"  and  (  J)  as  a  word  in  llie  middle  of  a  senti  nee  in 
a  eonfinuoiis  passage  (The  specific  context  was,  "We  hope-,  llu-relori-, 
a  Judicious  reader  will  give  himsdf  some  pains  to  observe.  .  .  ") 

'I'abU-  ii  shows  duration  and  F^  tre-qui-ney  measurements  for 
the  word  "will"  iillered  in  lliese  three  different  contexts.  (Figures 


H'i 


34,  35.  36,  37,  38,  39,  40,  41,  and  42)  Thase  are  measurements  for  three 
different  speakers.  Comparing  the  acoustic  data  representing  variance  in 
the  pronunciation  of  "will"  we  note  generally  that  the  frequency  of  the 
vowel  steady-state  is  highest  for  the  single  word,  somewhat  lower  for  the 
sentence,  and  considerably  lower  for  continuous  speech.  According  to 
Lindblom,  we  should  expect  formant  levels  to  be  influenced  by  duration: 
and  in  fact  in  most  cases  we  can  correlate  the  lowering  of  formant  levels 
with  the  decreased  duration  environmental  exigency  has  imp  osed.  However, 
although  both  the  vowel  offglide  and  the  1  are  considerably  shortened  in  the 
sentence  or  in  the  continuous  speech  as  compared  to  the  isolated  word,  fre¬ 
quency  levels  rise  in  both  cases.  In  the  case  of  the  sentence  environment 
the  rise  is  particularly  acute.  And  at  the  same  time,  the  reader  will  observe 
that  we  found  it  impossible  to  segment  between  the  I  and  y  of  will  you  in 
the  sentence  spoken  by  Speaker  1  (see  Figure  35).  So  we  conclude  that  the 
high  level  ot  1  was  the  effect  of  coarticulated  y.  The  words  will  you  must 
be  segmented  wi  and  llyou,  verifying  the  coarticulation  concept.  It  is  the 
coarticulated  lly  that  affects  the  vowel  offglide  of  i. 

In  the  cases  of  tlie  other  two  infoimants,  it  is  possible  for  a  human  being 
to  perform  a  very  intricate  segmentation  of  1  and  But  tliis  would  be  ex¬ 
tremely  difficult  for  a  machine  to  do  with  reliability.  Therefore  it  io  always 
preferable  to  segment  v.'i  llyou.  In  both  of  these  cases  (.Spi.-akers  Z  and  5) 
the  F,  frequency  level  of  ^before  y  is  much  higher  than  the  level  of  i  in 
the  other  eoiitexts.  In  the  first  section  of  this  report  (See  also  Appendix  B) 
our  chart  of  the  laterals  showed  four  phouu  classes  of  1  with  four  different 
places  of  articulation.  In  the  rules  listed  in  Appendix  H,  there  is  a  rule  to 
the  effect  that  before  ^  and  alveolar  lateral  becomes  a  palatal  lateral,  A 
palatal  lateral,  like  all  other  palatal  sounds,  has  a  high  F^.  The  liigh  F^ 
of  llie  palatal  laterals  of  will  you  (in  those  instances  where  _1-^  segmenta¬ 
tion  is  possible)  has  been  predicted  by  our  rules  of  euphonic  combination. 

In  fai  l,  the  coarliculation  of  the  1  — y  of  Speaker  1  fits  the  description  of 
ihc  palatal  h  (See  r'igure  38). 

IJidiorc  proceeding  to  our  next  example  of  coarliculation  we  wish  to 
make  two  observations.  First;  wi’  liavi'  already  inenlioned  that  we  intend 
t.o  include  the  semi-vowels  y  and  ^with  tliu  vowel  clusleru.  So  the  sugmeii- 
talion  wi  llyou  fits  our  model.  Second:  it  is  worth  noting  that  this  data 
negates  Use  Uehisli’'B  hypothesis  that  there  is  a  constant  ratio  for  the 
onglidc,  steady  -  slate ,  and  offglide  of  flic  vowel.  For  all  three  speakers 
the  offglide  is  longer  than  the  sleady-slale  in  the  ease  of  the  single 
word,  and  shorter  in  the  ea,se  of  the  I'ontinunns  passage. 

Lj. 


Tiu’  plirast*  from  wliich  this  sound  suciiuMuc  is  lakon  was  "will  givo 
himsulf".  Thu  spuidrogram  (sue  Figures  36,  39  and  4ii)  reveals  that  there 
is  a  eloburu  following  the  1.  'riicreforu,  Ihuru  is  iio  e oartieulation  of  1 
before  a  stop. 


88 


(3)  m»s 


From  the  sa^ie  phrase.  The  spectrogram  (  see  Figures  36.  3y, 
and  41)  shows  that  these  sounds  arc  separated;  no  evidence  oi  influ¬ 
encing  each  other. 

(4)  1-r 


From  the  same  phrase.  Sometimes  the  J  disappears  in  this 
environment.  In  this  case  (see  Figures  36  and  •l<i)\ve  liave  a  back  L 
before  the  f.  The  toarticulalion  {  the  infUu-ni  i;  of  f  u[»uii  i  )  can 
probably  be  expected  when  1  is  I'olLowed  by  any  Labial. 

(5)  o-^ 

The  next  phrase  analyzed  was  "to  obst  rve".  Mere  llu'  sound 
s<‘queiK‘e  o-o  (  see  Figures  36  and  d3)  biuonn-s  a  dipMiung.  We  ha\  i- 
not  conslrueled  rules  for  the  coartiiailalion  of  \  ossi  l  sounds;  dcler- 
minalion  of  such  rules  is  iu'yond  llu-  scope  <>1  llu-  pri-sent  stud/. 

(6) 

From  tie-  same  phr.i.se.  b  and  s  .iiu-  almost  co.i rl icu  [iiM  vl. 
b's  reUoise  is  wi’ak  and  short  but  th<'  moment  oi  j’ebasi'  is  ct  rli..inLy 
peiUC’pUbie  Figures  3t'  and  <M).  s  does  inlltn  lu  e  b's  ri‘e<iieiu.  v 

lovod;  lor  here  b's  <*nergy  is  iu?ar  that  of  the  s  and  of  i-our.se  b 
usually  has  its  releasir  energy  at  lower  frtMiui'neio.s, 

(7)  t-l 


'4il 


riu-  [)h  rase  studied  in  this  i  asewas  "it  i.t.sli-cb"  1-1  is  .in 
example  ot  et>art u-u lal ion  in  the  scuise  that  t  js  signifit  ant  ly  morli - 
fied  by  {  ,Soo  J.'‘igures  ajul  Ml).  First;  lie  friition  noise  for  t 
eontiiuu.'S  beyond  (he  first  voi<-ing  jjiilse  of  I.  .Second:  I's  aspir¬ 
ation  is  almost  absent.  Of  lonrse  the  loss  t>f  .i  .'jpi  ral  ion  is  fref|iienl 
vslit.Mi  Nvi‘  deal  with  final  I,  But  this  is  luh  a  final  t:  we  li.i\  e  shown 
that  its  friction  nois<'  runs  on  into  I  ;  this  I  .sliould  not  In-  I  realml 
els  an  inclaiu  e  of  final  1  aspiration  loss. 

{«) 

Fnuii  the  same  jjhrasi*.  The  rule  lint  reads  t  follov'.ing  s  b.'ses 
aspiration  does  not  truly  apply  her<-.  Aspiration  is  laiised  by  tie- 
natural  stre.ssing  ul  the  phiatse  as  a  whole:  l.i  is  empli.isized  .md 
bjed  is  nut  (  See  Figures  iO,  H.  aiul  'J-1).  This  la  . suits  in  som*’  h'.a  k 


K9 


ihiiifcnliMii 


of  definiteness  about  the  £  ;  there  is  an  occasional  aspiration 
of  the_t^,  as  here.  This  observation  points  out' the  importance  of 
intensity  measurements, 

(9)  lii}. 


The  sentence  considered  next  was  "No  animal  has  three  ears,  " 
and  the  effect  of  h  on  1  studied.  But  they  are  definitely  distinct 
(see  Figures  27  and  29).  _h  is  by  necessity  initial  in  English,  L£  ^ 
became  attached  to  the  next  phone  class  in  this  case  then,  h 
would  probably  be  lost.  Since  this  would  be  detrimental  to  com¬ 
prehension,  it  never  happens,  to  our  knowledge. 

(10)  s-th 


From  the  same  phrase.  In  the  word  "has"  s  is  usually  a  z. 
Here  however  the  spectrogrami  (  see  Figures  27  and  29)  shows 
voicing  cessation  --  s  becoming  primarily  s,  This  verifies  one  of 
our  early  rules  of  euphonic  combination,  N^hich  states  that  before  a 
voiceless  sound  a  voiced  sound  may  become  voiceless. 


(11)  k-th 

Finally  we  concentratcjd  on  the  S(?nlence,  "He  look  the  small 
kitten  fiome  with  him."  Her'-*  the  k-th  from  "took  the"  is  coarticulated, 
k  is  very  w(?ak  here  and  continues  into  the  th  sound:  it  is  exlremcTy 
difficult  if  nut  impossil)h*  lo  segment  betwtien  the  k  and  the?  th,  either 
on  the  spectrogram  or  on  the  time-amplitude  plots  (see  Figures  31  and 
‘IS). 

(12)  s-tn 


.l*'*ron»  the  sanur  ser»tenee.  'I’hiTi'  is  a  segment  of  about  forty 
inillisccunds  before?  m  in  which  the  noisi?  (tnergy  ol  s  is  extremely 
decreased  or  at This  may  be;  caused  by  the  opening  of  the 
veluni  and  the  consequent  side-tracking  ol  principal  air  flow  from 
the  inoiilh  cavity  the  nasal  cavity.  Tlu?  vocal  flap  oscillation  does 
not  begin  until  llie  end  ol  this  forty  inilliseeond  period.  This  period 
might  in  fat  I  bi?  the  phone  t  lays  lingnist.s  calf  voiceless  nasal.  But 
it  is  vt.-ry  difficull  for  a  macluiu?  to  classify  and  to  use  such  a 
segment  of  speech.  Evidently  such  a  voiceless  portion  is  normally 
present  wlu?n  s  is  followed  by  a  nasal  ••••  ''tlough  the  duration  of 
this  portion  varies  from  2U-10  millisc«  mds.  The  attenuated  energy 
level  lor  liie  friction  ol  s  does  not  in  all  eases  reach  the  low  level  it 
reacli<;s  with  Ihir  sm  ol  small:  lliis  is  the  influence  of  ni  (  see  Figure  4b), 


90 


V.  FURTHER  MEASUREMENTS  WHICH  INDICATE  THE  IMPORTANCE 
OF  DURATION  AND  INTENSITY,  AND  WHICH  SUBSTANTIATE  OUR 
APPROACH. 

From  time  to  time  we  have  mentioned  that  duration,  fundamental 
frequency,  and  intensity  are  dimensions  of  speech  whicii  merit  detailed 
analysis  of  certain  minute  portions  of  the  speech  waveform  for  effect¬ 
ive  speech  recognition.  We  have  performed  such  analysis  on  some  of 
our  data  (of  which  Figures  21-46  represent  only  a  portion);  from  this 
analysis  we  obtained  a  sizeable  amount  of  evidence  to  substantiate  our 
approach  and  to  indicate  the  necessity  of  further  study  of  the  dimension 
of  intensity.  Some  of  our  results  arc  summarized  below. 

The  importance  of  intensity  measurements  is  shown  by  the 
vowel-like  "nonsense”  segment  of  one  pitch  period  duration  wliich 
follows  the  m  in  some  in  the  phrase  "of  some  ancient  sages.  "  (see 
Figure  58).  This  segment  can  be  ignored  by  a  computer  working 
with  the  rule  mentioned  carliej;  in  this  section,  which  specifies 
the  minimal  number  of  pitch  period.s  which  are  allowed  in  a  legiti¬ 
mate  segment. 

The  necessity  of  redefining  stops  is  indicated  by  tlie  spectrograms 
of  different  (  Figures  47,  48,  and  49).  In  most  speakers'  pronunciation, 
there  is  no  stop  gap  before  the  The  definition  of  stops  could  thus 
bo  modified  by  specifying  that  the  stop  yap  may  be  absent  when  the 
t  follows  a  nasal.  (This^kvould  also  apply  in  a  word  such  as  mumps.  ) 

In  the  words  operation  and  observation  the  waveform  for 
musl  speakers  shows  either  no  vowel  segiitent  or  a  vowel  segment 
of  very  short  duration  between  Uie  J  and  the  n  in  the  tion  portion. 

A  rule  to  this  effect  should  be  incorporated  into  the  model. 

lii  the  observ  portion  of  observi'  and  observation,  moreover, 

(sec  Figures  36,  39,  42,  50,  51,  52,  53,  .54,  55,  and  56)  the  effect 
of  stress  or  intensity  is  apparent.  Measurements  of  thesis  two  words, 
spoken  by  the  same  speaker,  have  shown  an  11  to  18  ratio  in  llie 
overall  rate  of  arlieiilation  of  the  same  phone  cUasses  (observ  )  in 
obse rvation  and  observe  ;  si  eoiid,  there  is  a  20%  variation  (i.  e  ,  aluml 
200  eps)  in  the  seeond  formant  frequency;  third,  lliere  is  a  variation 
in  the  duration  of  the  individual  phone  elfisses  (espeeially  er  and  v) 
whieli  may  be  as  great  as  640%  or  as  little  as  7.  15%.  You  will  noliee 
furtljermure ,  that  our  data  im  hides  uol  only  speelrograms,  but  also 
lime -.amplilude  plots,  which  have  a  dynamic  range  of  about  45  dB. 

As  indieateci  by  the  rules  of  euphonic  eomliination ,  llie  t  of 
aneieril  beeoini’S  aspirant  in  "ancient  sage",  because  tJie  fr>llowin{» 
word  liegiiis  with  s. 


91 


In  the  wavetorm  of  "soundness  or  rottenness"  in  continuous 
speech  (See  Figures  57,  58,  and  59)  it  is  difficult  to  tell  whether  one 
r  or  two  were  spoken;  there  is  a  single  £  sound  indicated,  which  has 
an  abnormally  long  duration.  Boundaries  must  be  included  in  the  model 
to  specify  according  to  duration  whether  one  £  or  two  are  present.  Further¬ 
more,  a  computer  program  such  as  that  outlined  in  the  following  section 
must  be  included  to  provide  a  contextual  means  (according  to  the  "correct" 
or  dictionary  representation  of  words)  for  restoring  word  boundaries. 

The  treatment  of  h  as  a  voiceless  vowel  or  a  portion 
of  the  vowel  segment  mentioned  earlier  in  tnis  section  is  substantiated 
by  the  spectrograms  of  the  human  mind.  (See  Figures  62,  63,  64).  The 
£  portion  of  the  (Figure  63)  has  an  F^  frequency  of  1529  cycles  and  the 
i  portion  following  the  h  in  human  has  an  F^  frequency  of  1405  cycles. 

There  is  thus  no  significant  formant  change;  the  h  between  is  merely  a 
voiceless  or  weakly  voiced  portion  at  approximately  the  same  frequency. 
Furtnermoro,  in  "Mrs.  Slipslop,"  (Figures  65,66,67)  the  variation  of 
the  1  phone  class  justifies  treating  this  as  a  vowel  cluster. 

Finally,  the  spectrograms  of  "which  wise  sayings"  introduce 
several  interesting  details.  We  have  two  representations  of  which 
pronounced  by  Speaker  1.  (Figures  68  and  69).  In  Figure  68,  the 
duration  of  the  onglide  following  wh  is  69  milliseconds;  in  Figure  69,  the 
duration  of  the  same  portion  is  40.7  milliseconds,  although  the  overall 
duration  for  the  word  which  is  approximately  the  same  in  botn  cases 
(255.  V  ms  in  Figure  68  and  250.  8  ms  in  Figure  69).  Futhermorc,  in 
Figure  69,  no  real  steady-stale  is  ever  achieved  for  I,  whereas  in 
Figure  68  (with  tiie  slower  onglide)  there  is  a  slight  steady-state.  The 
onglidi'  portion  llius  seems  to  need  corrections  and/or  normalization; 
the  machine  must  be  instructed,  for  example,  that  the  more  rapid  onglide 
(Figure  69)  must  be  (jxtrapolated,  in  order  to  assign  the  proper  frequency, 
because  the  steady  state  frequency  in  lliis  case  (Figure  69)  is  approximately 
100  cycles  per  second  less  than  in  the  oiher  instance  (Figure  6«), 

The  ch  in  which  also  merits  attention.  Purliaps,  as  some  researchers 
have  suggested,  oni?  could  s.ample  the  ch  pattern  at  some  arbitrary  point  - 
sucli  as  6  iiiillisecoiids  after  llie  burst.  However,  the  use  of  such  a 
tecimique  needs  Juotificaliun.  before  it  can  be  used  witnoul  question. 
Sliort-time  statistic  s  of  the  cli  waveform  might  be  necessary,  because 
of  its  irregular  nature,  but  certainly  overall  normalization  of  which 
would  disproportionately  compress  Ihe  ch  portion  cf  Speaker  2's 
pronunciation,  wuore  liie  dur.ation  of  ch  is  only  62.  7  milliseconds  (Figure 
70)  as  compared  to  125.4  and  103.5  milliseconds  in  Speaker  I's 
artieuiations  (Figure  68  and  69). 


92 


fc/  Spe«V, 


iMU'HII 


Which  conforms  also  to  our  segmentation  principle  as  out¬ 
lined  in  Section  5  of  this  report.  Whi  is  one  consonant-vowel  portion; 
chwa(i)i3  another. 

These  are  not  exhaustive  examples,  but  only  a  few  significant 
details  v/hich  justify  our  classification,  and  as  we  shall  see  in  the 
following  section,  which  also  support  our  approach  to  segmentation. 


93 


SECTION  5:  SEGMENTATION  AND  CONSIDERATIUNS  FOR  COMPUTER 
OPERATIONS 


INTRODUCTION 

Our  examination  of  the  linguistic,  phonetic,  genetive,  and  acoustic 
aspects  of  speech  has  substantiated  our  original  concept  of  the  multi¬ 
dimensional  model  as  an  orderly  basis  for  representing  speech  information, 
and  has  somewhat  modified  our  original  representation. 

Furthermore,  our  research  has  enabled  us  to  develop  a  method 
of  segmentation  which  is  suitable  for  automatic  speech  recognition. 

This  segmentation  principle  is  explained  in  Part  I  of  this  section. 

More.over,  as  we  mention  in  that  discussion,  we  have  tentatively  applied 
this  technique  to  spectrograms  of  words  from  Visible  Speech  ,  Truby, 
and  also  to  spectrograms  and  time -amplitude  plots  of  discrete  and 
continuous  speech  which  wore  generated  during  this  project,  in  order 
to  ascertain  whether  our  approach  to  segmentation  seems  warranted 
by  the  evidence.  Such  evidence  seemed  necessary  to  substantiate 
the  asHii-iptions  (linguistic  and  genetive)  whicli  were  made  not  only  for 
tlic  model,  but  also  for  the  rules  of  euplioiiic  combination  and  coarticulation. 
Our  measurements  have  verified  both  our  approac  h  and  our  method  of 
segmentation  on  an  acoustic  tev<d.  Thus  we  believe  we  have  selected 
those  segments  of  speech  whicdi  beest  dc.'scribe  llie  realities  of  speech 
events,  yet  whiclt  will  be  most  meaningful  to  an  automatic  speech 
recognizer.  In  so  doing,  we  have;  succccssfully  integrated  the  data 
available  from  llie  various  sources  --  genetive^  linguistic,  piionclic, 
acoustic,  etc.  --  into  an  ordcu'ly  representiilion  which  could  sca'vo  as  tlic 
basis  for  a  gcnier.il  puipose  recognizer. 

in  Part  ii  of  litis  section,  wc  outline  several  possible  approaches 
to  the  computer  program  whicli  is  to  rcsolue  the  perceived  sounds  --  i.  e. 
to  pitrlorm  llie  dictionary  inali  It.  Part  II  completes  llie  study  by  pro¬ 
viding  an  oulliiie  of  the  fimelions  <>l  tlic  various  pliases  of  the  recognizer 
and  a  discussion  of  liow  our  conlribution s  may  be  used  in  each  Jiliasu  to 
make  tiu'  recognition  program  operative. 

I.  SEGMENTATION 

'J'liroiighout  litis  project,  wi^  have  emphasized  that  the  place  and 
manner  of  v.ariovts  imnsonanls  can  tdiange,  according  to  llie  vowel  whicli 
precedes  or  follows  it;  llius  tin;  genetive,  iingiiistic .  and  acoustic  repre- 
seiilalion  of  a  i  onsonatil  may  cltattgt:.  As  resuaruh  progressed,  moreover, 
it  beeanie  apparent  that  eerlain  sounds  whicli  liad  traditionally  been  described 
as  separate  entities  were,  as  II.  M.  Truby  emplisizes,  .acoustically  inter¬ 
dependent,  or  coartieulated.  I'liis  evuieiu  e  of  coarticulation,  and  also 


9  1 


the  evidence  of  transitions  provided  by  Haskins  Laboratories  in  their 
attempt  to  produce  synthetic  speech,  have  greatly  influenced  our  concept  of  a 
meaningful  machine  segment.  On  the  basis  of  the  evidence  we  have 
examined,  we  recommend  that,  generally  speaking,  the  most  meaning¬ 
ful  unit  for  machine  recognition  will  be  a  "consonant-transition-vowel" 
segment,  including  any  offglide  of  the  sound  which  precedes  the  consonant 
and  thus  helps  to  identify  it,  and  including  the  onglide  of  the  vowel  to 
the  point  where  a  steady-state  is  achieved. 

Our  method  of  segmentation  can  be  illustrated  by  comparing 
it  with  the  segments  proposed  by  Gunnar  Fant  and  Bjorn  Lindblom  in 
their  "Studies  of  Minimal  Speech  Units.  "  In  Figures  I-l  of  that 
article,  the  authors  have  marked  18  segments  in  a  spectrographic  record 
of  the  words  "Santa  Claus.  "  Segments  9-15  of  their  analysis  would  be 
treated  as  one  segment  in  our  model.  This  segment  would  include  the 
k  (which  first  shows  up  in  the  offglide  pattern  of  the  vowel  )  ,  the  1 
(which  is  coarticulated  w‘th  the  k),  and  the  onglide  and  steady-slate 
portions  of  the  3^  :  sound.  From  the  middle  of  the  steady-stale  to  the 
end  of  the  z  forms  another  segment  --  a  vowel-consonant  combination. 

Such  a  segment  provides  a  meaningful  unit  for  machine  recognition, 
since  U  depends  upon  the  rate  of  transition  from  the  conaonai.i.1  to 
the  vowel  (and  vice  versa)  rather  than  the  absolute  formant  frequency 
values,  which  may  change  according  to  their  environment  £ind  context. 
Segments  are  "matched"  by  correlating  the  smallest  articulated  acoustic 
representations  of  these  negmenls.  In  order  to  tnaleli  llieiii  mure  perfectly 
we  can  eilhttr  (1)  e  jualize  them,  by  changing  the  duration  of  portions  of  a 
spectrogram,  without  altering  their  spectral  chinsity  charaeteristie  s,  or 
{i)  we  can  derive  shorl-time  stalisties  to  compare  two  portions  of  time  • 
amplitude  plots. 

CXir  melliod  assume.s  ttial  the  vowel  and  imnsonanl  components  of 
speech  are  interdependent  and  .siiould  not  be  separated  ill  reeognilion.  To 
justify  this  assumption,  we  have  pe  rfurmetl  considerable  measurements 
of  v/ords  in  Visible  Speech  .  If  our  technique  applies  to  these  to  these  words, 
it  should  apply  to  most  samples  of  hlnglish  speech.  We  are  emphasizing 
the  inlcrdependence  of  sounds  in  continuous  speech,  and  tlie  words  in 
Visible  Speech  are  discretely  articulated  spoecli.  We  found  in  Visible 
Speei  h  considerable  evidence  to  substantiate  our  iirinciple  of  segmentation. 
Furthermore,  we  performed  measurements  on  our  own  samples  of 
coniimious  speech,  and  found  that  the  segmenlalion  principle  was 
equally  applicable. 

AUliough  our  study  is  not  directly  concerned  with  vowids,  we  luivu 
founil  that  the  rale  of  change  found  in  the  vowel  onglide  provides  a  valuable 
indlcaiion  of  wliat  consonant  preceded  that  vowel.  Fur  instance,  ttie  onglide 


95 


3 

m 

I 

f- 


of  i  in  fee  haa  a  duration  of  8E  milliseconds,  whereas  the  duration  of 
the  onglide  of  the  i  in  key  is  38  milliseconds  (Visible  Speech,  pages 
lEl  and  51).  Furthermore,  in  week,  the  duration  of  the  i  onglide  is 
57  milliseconds,  whereas  in  it  is  304  milliseconds  (Visible  Speech, 
pages  207  and  113).  To  account  for  this,  we  can  set  limits  for  the  rate 
of  transition,  so  that  beyond  those  limits,  the  sound  must  belong  in 
another  category.  That  is,  if  the  rate  of  change  were  below  a  certain 
slope,  the  sound  would  be  matched  with  one  segment;  but  if  the  rate  of 
change  were  above  (i.  e.  more  rapid)  than  that  slope,  the  sound  would  be 
classified  in  another  category. 

This  rate  of  change  is  responsible  for  the  duration  of  the  vowel 
onglide,  but  also  of  the  steady-state  of  the  vowel  frequency.  This,  too, 
seems  to  vary  according  to  the  consonant  which  precedes  the  vowel;  in 
leave  ,  the  steady-state  frequency  of  the  i  is  2099  cycles  per  second;  in 
reed  it  is  1808  cycles  per  second.  Tins  change  in  frequency  according 
to  llic  consonant  will  obviously  inflnonce  the  slope  of  the  consonant-vowel 
transition.  For  this  reason,  it  does  not  seem  feasible  to  normaliKC  the 
vowel  steady- stale,  and  still  expect  Aci,  urate  recognition.  Instead,  wc  have 
specified  a  segment  which  can  readily  accommodate  thr^  wide  variations  in 
duration  and  freciuency  characteristics  which  our  measurements  have 
found. 

It  may  seem  that  our  coarticulatcd  sound  cluster  is  the  rough 
equivalent  of  what  is  commonly  called  a  syllabic,  such  an  analogy  is  not 
inherent  in  our  thinking.  Instead,  we  have  dcveloi-iuO  our  principle  of 
segmentation  from  linguistic,  gmictive,  phonetic,  and  acoustic  aspects, 
and  liav(!  sought  continuously  lo  specify  the  smallest  recogniaahlc  (and 
therefore  eonstanl)  articulated  unit  of  sound.  Our  basic  maclunc  acoustii: 
unit,  should  not  be  considered  the  equivalent  of  a  syllable. 

A  main  source  of  our  segmentation  principle  was  the  information 
flerivcul  from  oiir  linj'iiislic  study  --  particularly  llie  grammar  of  Sanskrit. 
For  in  that  grammar,  an  individual  phone  class  is  specified  to  represent 
e.n  h  consonant-vowel  < omliiiialion;  tlu  S(r  classes  of  sounds  have  been 
tested  lo  determine  tlieir  appli<  ubili ly  lo  Uie  Isnglisli  language.  It  was  found 
lhat  acoustually ,  Fnglish  speech  can  also  be  divided  into  classes  of'C-V" 
l  oinljinations,  allhough  the  classes  .ire  not  the  same  in  each  language. 

Consonant  Clnslers 

The  Fnglish  language  is  not  merely  a  sequence  of  CV  and  VC 
cominnalions;  two  otlier  impurlani  groups  of  sounds  occur  --  vowel 


90 


and  consonant  clusters.  While  this  study  does  not  attempt  to  deal 
with  the  special  problems  presented  by  vowels,  we  do  have  certain 
recommendations  about  the  treatment  of  consonant  clusters. 

For  the  purposes  of  the  percelver  (for  which  we  have  developed 
this  segmentation  principle)  most  of  what  are  commonly  regarded  as 
clusters  will  be  treated  as  separate  entities.  A  consonant  cluster  such 
as  str  seems  to  be  different,  acoustically,  from  s  +  t  +  r.  By  wayof 
evidence  to  justify  this  position,  it  has  been  found  that  the  t  in  treat  may 
possibly  be  aspirated  whereas  it  is  Highly  unlikely  that  the  t  in  street 
will  ue  aspiraieu.  oimiiarly,  the  r  in  trade  may  liave  a  shorter  duration  and  a 
higher  frequency  tlian  the  ^  in  raid.  Thus  str  or  st  or  tr  is  not  the 
mere  sum  of  its  components,  hut  a  special  class  of  sounds  which  requires 
certain  movements  of  the  articulators,  and  which  thus  produces  a  distinctive 
acoustic  pattern. 

B.  Rcfincmcnl  of  the  Concept  of  Coarticulation. 

Our  lueasureiTienl  s  of  the  material  in  Vi  sible  Speech  yielded 
evidence  to  substantiate  the  concept  of  coa rticulalion.  There  was  no 
measureable  voiced  onglide  between  the  p  and  £  in  person,  (j).  180), 
between  p  and  e  in  pep  up,  (p.  8^)  and  between  and  ^  in  pipe  (p.  85) 

Aiso,  in  pep  (p.  84),  pass  (p.  139),  and  pup  (p.  101),  there  is  a  close 
eorrespondenee  between  the  frequency  .at  ihe  start  of  Ihe  voiced  onglicle 
and  Llie  (F^)  frequency  of  the  vowel  .steady-state. 

Tile  evidence  presented  by  M.  M.  Truby  yields  even  more  examples 
of  coavlieulation  than  he  points  out.  (AeUt  Radiologiea,  Supplementuin  18<1, 
.Sloekholm,  1959.) 

(I, )  For  instance,  the  spectrogram  of  Ihe  wiu’cl  jaunt  ([i,  19)  shown  no 
stop  between  the  n  and  the  I  -  -  as  we  hud  also  noticed  in  the  word 
different  in  oiir  own  ilaia. 

[Z)  Furllicrmore,  as  we  inentioiu'd  in  Seclion  4  of  this  report,  th<'  r 

in  words  sui  li  us  chce r  (p.  14)  .ind  Georgia  (p.  19)  will  be  treated 
not  as  a  separate  class  of  sound,  but  as  a  vovvtd  cluster  or  portion 
influeticing  the  offglidc  i  har.actiirislies  of  the  aci’ompanying  vciwei. 

(3)  In  jounce  (p.  <10)  Truby  has  used  two  plionetic  symbols  to  ri;presenl 
the  vowels  a  and  ii.  We  would  instead  make  this  i  ombinaliuu  a 
sl^Jeeial  class  of  vi>W'el. 

(4)  Me  has  also  represented  tlie  ce  porlion  of  jounce  phonetically  as 
Is.  The  sound  heia*,  we  lliiuk,  i.s  more  than  a  sequence  of  t  and  s: 
again,  it  seems  instead  to  hi-  a  unique  class  of  sounds. 


97 


(5)  In  a  pi  combination,  such  as  in  plink  (p.  20),  we  might  have  to  specify 
that  the  p  can  be  unaspirated.  In  blink  (p.  20)  it  is  possible  that  the 

b  is  modified  by  the  1  ,  so  it  might  be  advisable  to  include  bli  as  a 
distinct  category;  futthermore,  the  "typical"  energy  distribution  of  k 
may  be  altered  in  kl  combinations,  such  as  clip  (p.  25).  Perhaps 
even  (as  in  glib  p.  28)  must  be  treated  as  a  separate  acoustic 
element. 

(6)  Again,  y,  r,  1,  and  w  seem  often  to  be  coarticulated  with  the  adjoin¬ 
ing  vowel;  thus  words  such  as  tweak  (p.  47)  are  only  one  CVC  utterance  -- 
the  w  becoming  part  of  the  following  vowel. 

(7)  Finally  in  the  kl  combination  in  the  word  sclaff  (p.  51),  there  is  a 
strong  po.ssibility  that  the  k  will  not  be  aspirated,  when  the  following 
vowel  is  emphasized.  This  is  much  like  the  case  of  pi  in  plink 
mentioned  above  (Truby,  p.  20).  This  might  also  be  true  of  the  p 

in  spree  (p.  52)  and  in  other  cases  where  the  vowel  following  the  p 
is  emphasized. 

Tliese  examples  of  coarticulation  occur  within  individus,!  words, 
it  is  also  lughly  possible  that  coarticulation  may  occur  at  word  boundaries, 
as  in  tile  ^  of  "it  lasted"  mentioned  in  Section  i  of  this  report. 

Coartii  uiation  seems  an  iii.pui'iant  <.  oiic  i^pt  in  describing  the 
realities  of  speech  events.  Thi;  iUusi  rations  used  above  are  not  a  formal 
organization  of  all  possible  incidences  of  coarlieulation,  but  they  point  out 
certain  "probUnti"  segments  or  eombinations  which  must  receive  special 
attention.  Our  prineiple  '.>{  segmentation  i  .s  designed  to  deal  with  just  such 
problem  siMonenls  as  these,  bn  using  CV  units,  and  .also  by  treating  such 
.sounds  as  y,  r,  1,  and  w  as  vowels  or  members  of  "vowel  elusteins.  " 

II  INFORM  A  il  ON  ON  THE  OCCURKNClii  OF  HULKS 

U.sing  oi.r  I:,;.;.  . . .  ........  . '.ve  studied  the  possibility  that 

tlie  rules  ol  euphonic  <. ombinalii.n  wci't  indcecl  operative  in  certain  segments 
of  soeeeli  samples  of  some  of  the  subjects  but  that  tliese  very  same  segment  b 
could  ludicate  that  ttiese  same  rules  were  not  oiierative  in  the  speech  .samples 
of  tile  rest  of  the  subji  i  Is  wliosi'  spem  h  was  analyzed  fur  this  study.  We 
include  it  in  Table  b  because  altlnnigii  inadiu]ualir  in  any  conclusive  sense,  it 
provides  some  initial  inform. itioii  about  occurrence  frequency  that  might 
inliiieiice  our  ordering  of  our  euplioiuc  combination  rules  ir'  thi^  computer. 

We  evuuitually  plan  to  order  our  rules  so  that  tliose  rules  thai.  apply  most  often 
come  first.  We  are  grouping  our  rules  by  related  siluations  of  sequence  and 
ordering  them  bj  groups.  .Statistical  information  on  the  probability  of 
opcrtition  ot  one  or  moi-v’  of  tliese  rules  could  possibly  improvt?  the  efficiency 
of  our  program. 


V8 


different.  'difierent  o;::;€ratioiiF  '  t'  coarticulated  wit-'  'its  Yes  ?  Tso 


m  OUTLINE  OF  APRROACHES  TO  THE  COMPUTER  PROGRAMS 


The  sequence  oi  speech  sounds  in  the  construction  and  transmission 
of  words  and  utterances  is  due  to  the  physical  limitations  of  speech-producing 
mechanisms  and  to  the  demands  of  linguistic  tradition.  We  have  developed 
rules  of  euphonic  combination,  based  on  an  understanding  of  preferred 
positions  for  sound,  to  determine  how  sounds  are  modified  by  speech- 
environment.  In  the  conceptualization  of  a  multi -dimensional  model  for 
speech  recognition,  we  integrated  data  on  the  genetivc,  phonetic,  phonemic, 
and  acoustical  aspects  of  speech  -  -  in  a  manner  faithful  to  the  realities 
of  speech  events.  The  rules  we  derived  from  this  and  other  information 
represent  the  first  time  an  orderly  approach  to  the  modifications  of 
adjoining  phone  classes  has  been  clearly  defined.  And  the  rules  are 
practicable.  We  have  reduced  than)  to  sjiuiboiic  representation  and 
prepared  them  for  use  in  a  com.puler  program.  We  have  approximately 
five  hundred  rules:  but  wc  were  able  to  group  these  to  reduce  the  number 
of  rules  the  computer  must  store.  This  grouping  was  made  possible 
by  the  nature  of  the  structural  ordering  of  phone  classes.  Phone  classes 
are  related  by  the  dynamics  of  arliculatiun;_Pj  L  aiidj^  are  related,  as 
are  d,  and  b.  .So,  that  which  applies  -  descriptively  to  the  combination 
of  k  and  £  applies  also  to  £  and  d  or  p  and  b.  Therefore,  a  eotnputer  need 
only  store  aboul  (.fifty  rules  for  defining  the  effects  of  adjoining  forms 
on  each  other. 

We  can  choose  from  a  number  of  ivielhods  in  designing  the  system 
by  whieli  our  machine  actually  coinpule.s  what  euphonic  reductions  it  must 
account  for.  At  present,  three  sue):  methods  are  under  consideration. 

In  oav.h,  the  application  of  our  rules  is  fund.mienlal. 

Tlie  reversed  rule  method  involves  the  application  of  all  appli¬ 
cable  reversed  rules  to  any  given  situation.  PredeU.'rnuning  possible 
eonsoiuint  reductions  wi  II ,  loan  extent,  mitigate  the  lorinidalilu  pro¬ 
blem  of  such  an  approach  (tinr  pro.Uferalion  of  po.ssible  rule  applications.  ) 
Constant  retereiK  (•  to  a  list  of  allowable  consonant  clu.sters  after  each 
rule  application  is  still  adiiiilleilly  inefficient.  So  of  our  three  methods, 
the  reversed  rule  method  is  the  one  we  are  least  likely  to  employ. 

'I'lie  eo.tsonatit  cluster  method  involves  the  construction  of  a 
dictionary  containing  all  reduced  forms  of  consejuant  clusters  and  all 
possible  antecedents,  of  those  ctnst(.‘i's.  By  first  finding  all  the  correct 
antecedents  of  eaeii  initial  cluster,  we  establish  llie  environment  for  any 
terminal  cluster  we  consider.  Of  course  an  nnderslitnding  of  antecedents 
(unreduced  consonant  clusters)  requires  an  understanding  of  how  consonant 
clusters  are  reduced  in  speech.  So  it  is  impossible  to  construct  lists 
without  our  rules  of  euphonic  combination.  Once  such  iisls  are  (.mlabUsht'd 


-I 


99 


for  the  determination  of  terminal -initial  clusters  we  may  apply  them  to 
medial  clusters;  breaking  medial  clusters  into  terminal-initial  clusters 
and  then  solving.  But  at  present  the  problem  of  medial  cluster  segmentation, 
among  other  problems,  makes  it  more  likely  that  we  will  use  an  alterna¬ 
tive  solution  (by  our  rules). 

In  our  treatment  of  consonant  clusters,  we  considered  the  treatment 
of  semivowels.  These  we  found  it  easiest  to  deal  with  oy  the  rules  alone  - 
that  is,  without  the  implementation  of  lists  describing  particular  or  even 
general  occurrences  of  the  semivowel  in  speech.  The  rules  are  in  this 
case  sufficient  to  account  for  the  elision  or  insertion  of  a  semivowel. 

The  Reduced  Word  Dictionary  Method  is  similar  to  the  consonant 
cluster  method  in  that  both  depend  on  a  thorough  and  comprehensive  appli¬ 
cation  of  the  rules  of  euphonic  combination  to  provide  a  listing  of  reduced 
forms.  Since  here  we  arc  dealing  with  whole  words,  word  division  is  a 
pr^n^ary  concern;  our  rules  are  of  further  use  since  they  represent  initial 
breakthrouglis  in  the  treatment  of  the  problem  of  word  juncture.  Further¬ 
more  we  art!  now  iivaluating  a  numbt!r  of  techniques  to  facilitate  the 
platung  of  a  woril  tlivisiitn.  Partimlar  attt!ntton  is  given  to  technique 
such  as  aiphabutiaing  search  arguments,  either  in  context  or  isolated 
from  cont<!Xt.  Protection  and  v(!rificalion  lecliniques  arc  also  in  tlie 
l)roces.s  of  final  formulation  (thest;  latter  may  be  used  with  either  the 
con.soiiant  cluster  or  the  reductfd  word  approach.  ) 

Wt!  have  not  yet  come  to  a  final  det  ision  abo;)!  tlic  paiticular  method 
we  will  t  house.  To  do  this  would  require  a  decision  about  the  kind  of 
ctjiupviter  macliint!ry  wo  will  employ.  Relative  differences  in  the  amount  of 
(,'lei'icai  work  in,:ccssary  in  the  i:oiiipiiatioo  of  different  dictionaries  will 
also  require  scrutiny.  And  after  that,  we  will  have  to  make  ti!sls  on 
computers  to  eoiiipiice  time  diCfercnce.s  in  the  methods  with  all  their 
<t(  company ing  techniques. 

In  onr  origin.al  iiroposal  we  wroU  ;  "With  the  reeogniKer,  however, 
the  problem  is  not  to  iliscover  words  (these  being  known  in  advance  to  llie 
designer),  but  rather  to  eiiMire  that  borders  are  included  properly  in 
tlu'  machine  outpnl  as  spai  es  helween  words.  Ihat  is,  a  machine  that 
operates  wilh  aeoiislieal  dala  must  make  decisions  about  nun-acoustical 
lilienuiiKuia.  This  prohli-iii  is  nu  doubt  beyond  the  capability  of  present 
Iheury,  nur  dues  its  suliiliun  seem  especially  urgent  in  the  context  of 
other,  more  b.'isii  copsiileratiuns.  However,  its  relation  to  some  other 
problems  may  biiiig  it  in  for  cursory  study  during  the  proposed  research 
prograoi.  The  v.  urk  ubo\'e  show  s  that  we  have  gone  far  beyond  this. 


100 


SPEAKER 


DICTIONARY 

GRAMMAR 

a 

SYNTAX 


SPEECH 


RECOGNIZER 


OiaHrani  nf  Spfi'i'li  I’rorliictioii 
aiifl  Perception  ProcesK 

Kigui'i’  72 


100a 


SECTION  6:  CONCLUSIONS 


Figure  71  is  a  schematic  diagram  of  the  speech  and  recognition 
process.  Our  study  has  centered  about  the  speech  articulation  phase, 
which  obviously  bears  a  direct  relationship  to  the  nature  of  the  speech 
waveforms.  Since  it  is  the  speech  waveforms  which  comprise  the  input 
data  to  the  perceiver  and  hence  the  formatter,  we  must  understand  the 
possible  and  probable  speech  events  which  occur  in  the  articulation  phase 
before  an  automatic  recogniaer  can  be  designed. 

We  are  convinced  that  the  acoustic  information  which  can  be 
gathered  from  continuous  articulation  is  more  complex  than  a  mere 
succession  of  phonemes.  We  have  understood  this  complexity  to  consist 
of  slurs,  or  the  incorrect  pronunciation  of  certain  phone  classes  which 
occur  in  the  orthographic  form  of  language. 

To  explain  this  imprecise  pronunciation,  we  have  collected  a  large 
body  of  data  which  we  have  organized  into  more  than  500  rules  of  euphonic 
combination.  Wc  have  found  that  group  theory  can  be  employed  to  order 
tliese  rules  according  to  the  degrees  of  freedom  available  in  the  articul¬ 
ation  of  speech  sounds  and  to  compress  this  body  of  rules  into  50  rules  in 
s/mboiie  notation,  suitable  for  computer  storage.  Such  stored  information 
provides  an  error-correcting  code  which  can  bo  used  to  rccoiicilo  imper¬ 
fectly  articulated  continuous  (and  again  we  emphasize,  normal  )  speech 
with  orthographic  script. 

Spimch  recognizers  have  been  built  in  the  past  which  assume  lliat 
a  machine  is  capable  of  recognizing  the  words  or  phonemes.  Most  of  these 
machines  (See  Figure  18,  Section  d)  have  (uijoycd  only  Jiniited  success. 

In  all  those  designs,  Iho  vocabulary  has  boon  limited;  moreover,  when  the 
plionome  was  the  segment  to  be  rueognizod,  tin'  single  phoneme  had  to 
bo  articulated  in  a  fixed-eon,sonanl  environment. 

It  is  d(!siral)lc  to  evii-nd  the  success  of  tliesc  methods  beyond  llieir 
present  limitations;  tiial  is,  it  is  desirable  to  extend  the  size  of  Uie  vocabu¬ 
lary  recognized  and  the  environments  in  v.'iuch  sounds  can  be  recognized. 

It  has  often  been  possible,  lliougii,  that  hy  refining  ti'.e  existing  meliiods 
one  could  increase  the  number  of  words  recognized  a:'.d  so  extend  the 
applicability  of  the  present  s))eech  recognizers. 

Perhaps,  as  we  have  suggested  tiirougliout  tins  report,  the  more 
useful  approach  i.s  not  to  be  found  in  attempting  to  develop  a  speech  recog¬ 
nizer  witii  tile  limited  amount  of  information  wiiLcli  is  presently  available. 

The  more  useful  approach  migiil  be  to  design  a  method  of  computer  operation 
which  can  anticipate  and  account  for  acoustic  imprucisions  of  speech.  With 
tins  as  a  goal,  we  have  examined  tiie  nature  of  continuous  ("normal")  speech, 


101 


in  an  attempt  to  ascertain  wliat  imprecisions  of  articulation  can  be  expected. 
Our  study,  we  beliene,  has  been  conclusive,  if  not  exhaustive,  the 
significant  details  of  the  speech  waveform  which  were  examined  in 
Section  4  clearly  demonstrate  the  validity  of  our  approach.  Certainly 
extensive  proof  would  require  the  generation  of  additional  data. 

The  most  outstanding  characteristic  of  continuous  speech,  and 
that  which  most  clearly  distinguishes  it  from  discrete  speech,  is  that 
in  continuous  speecli  sounds  modify  surrounding  sounds,  in  a  continuous 
series  of  events  which  are  neither  a  multiplication  nor  an  acceleration 
of  the  events  of  discrete  articulation.  We  have  compiled  extensive  evidence 
which  clearly  demonstrates  that:  (1)  the  articulation  of  words  or  vowel 
sounds  in  isolation  results  in  waveforms  which  arc  significantly  different 
from  the  waveforms  of  the  same  phone  classes  spoken  in  continuous  speech. 
(2)  two  or  more  phone  classes  tend  to  be  coarticulatcd  (spoken  as  one 
sound).  The  coarticulated  sound  has  a  waveform  which  is  significantly 
different  from  tlie  waveform  of  either  and/or  both  the  component  phone 
classes  in  careful  articulation.  (3)  the  word  boundaries  which  are  found 
in  ortliographic  script  are  almost  totally  lost  in  continuous  speech. 

It  has  been  our  contention  that  these  combinations  or  modifications 
of  sounds  occur  in  the  English  language  in  a  predictable  way,  which  can 
be  accounted  for  according  to  determinate  rules;  furthermore,  these 
rules  can  be  lelalcd  to  one  another  in  an  orderly  fashion.  On  this  basis, 
a  model  can  be  constructed  which  is  patterned  according  to  the  various 
dimensions  of  sounds  --  place  and  manner  of  articulation,  degree  of 
resonance  and  aspiration,  inlensily,  duration,  .Such  a  model  thus  would 
be  called  "multidimensional.  " 

Conceivably,  tins  study  could  liavc  been  undertaken  by  aUempling 
to  collect  vast  samples  of  presentday  spoken  Engii.sh.  We  liave  chosen 
instead  to  begin  where  more  evidence  is  more  readily  available.  Wo  liave 
at  our  disposal,  for  instance,  a  dictionary  of  the  English  language,  wliich 
lists  in  phonetic  symbols  the  accepteil  ijruimnciations  of  each  word.  A 
large  body  of  knowledge,  the  result  of  thousaiuls  of  years  of  linguistic 
stvidy,  is  equally  avaiiabie:  this  lijiguistic  literature  Ihorouglily  de.seribes 
llie  sound  changes  wliich  liave  occurred  in  Iht!  historical  development  of 
languages  (for  example,  the  (jeriuan  d  became  lliu  English  1).  By  applying 
the  Ergotlic  'I'heory  from  physics,  we  were  teiiUilively  able  to  assume 
that  tiiesc  liistoric  examples  of  sound  cliange  might  provide  a  basis  lor 
the  kinds  ot  sound  change  wliicli  occur  in  spoken  language  today,  since  all 
tile  Pa'Olo-Indo-European  languages  studied  utilize  the  same  physical 
modes  ot  production.  We  llicn  proceeded  to  test  the  body  of  Rules  for 
Euphonic  Combination  on  presiuitday  speecii.  Certain  of  tltese  rules  found 
justification;  others  were  modified  or  rejected,  ■tceording  to  the  evidence. 


102 


Moreover,  the  eandhl  rules  of  Sanskrit  describe 
modifications  or  substitutions  of  certain  phone  classes,  when 
the  phonetic  environment  is  altered.  This  is  much  the  same 
phenomenon  as  what  we  have  pointed  out  in  our  rules  of  coarti¬ 
culation  and  euphonic  combination.  In  the  present  stud/,  it 
was  thought  that  the  speech  waveform  of  certain  words  in  English 
might  contain  phone  classes  which  exist  in  the  spoken  language, 
but  which  are  incorrect  according  to  the  orthographic  indications. 
Certain  examples  of  this  phenomenon  have  been  cited  in  this 
report:  for  example,  bet  you  often  becomes  be  chyou  in  continuous 
speech. 


In  the  Multidimensional  Model  the  classification  of  a  parti 
cular  sound  depends  upon  the  degree  u(  freedom  which  is  available 
in  the  physical  process  of  sound  production.  Sounds  which  are 
“adjacent"  in  the  model  ar(»  .sonn<!s  whicii  are  produced  almost 
identically.  As  indicated  on  the  diagram  of  the  mock?!  included 
as  part  of  Appendix  B,  the  horizontal  axis  repreycuits  the  degree 
of  aspiration  or  resonance,  Ihi*  vertical  axis  rr-presents  the  place 
of  articulation,  and  the  d('))lh  axi.s  representn  the  maniv  r  of 
articulation.  With  such  a  method  for  ordering  .speech  sounds, 
wc  can  conceive  of  compuler  programming  which  depends  coneepl- 
ually  on  the  perceiver  to  replug  unacct'ptable  phone  classes  with 
the  phone  class  whose  waveform  i:ha ractc?r Isiic s  are  nearc’st  to 
the  “incorrect''  class  which  was  presented.  'Duis  in  the  case  of 
bet  you  becoming  be  chyou,  it  is  recorded  a.s  a  rule  at  euphonic 
combination  that  the  alveolar  stoj)  I  l)ec<;rncB  the  paUilal  affricate 
eh  the  semi-vowt?l  y.  'J'his  rule  (  an  be  used  in  connection 

witli  the  model  representation  to  a.nli<  jpetr-  and  rf.rr.M  i  itnprf?ci.m’ 
articulation,  as  is  found  in  c(>nlinuo\is  speech, 

Eurlherinori',  the  segmenis  wliicli  v.'e  have  described  <xrv 
much  mor(‘  flexible  than  any  previously  mi'iilioned  by  oilier  researi 
The  use  of  a  (NV  t  (>mbinatic)n  as  a  basii  acoustic  segment  allows 
botli  for  individinil  variations  in  (lie  prominci.'ition  of  ceiTnin  phoin' 
class(?s,  an<l  akso  the  acoustic  variations  whii  h  in  sult  from  the 
phonetic  environment  of  a  given  phone  class. 

CileaiTy  such  a  sysltun  as  we  outline  above  places  less 
restriction  on  the  person  who  uses  tins  machine:  h(>  ij;  no  longer 
Urnited  to  the  number  of  phonemes  or  the  (  iivironment  wdiich 
can  be  allowed;  moreover,  llie  arlieulation  need  not  Ire  strained  -- 
normal  speech  car»  be  correctly  perceived  and  printed. 


hers. 


103 


Mijjfeaaiil; 


APPENDIX  A 


The.se  data  arc  a  very  small  portion  of  a  corpus  or  words 
and  sentences  transcribed  from  the  speech  of  a  native  speaker  of 
Vietnamese.  Wc  believe  that  tlio  transcription  is  accurate  and  the 
data  arc*  sufficiently  complete  for  this  analysis.  We  do  not,  however, 
know  wliat  dialect  the  informant  spoke,  and  it  lis  possible  tluvt  this 
ajKilysis  is  nut  valid  for  oih<’r  diiiiecLs. 

The  following  list  of  wi>rds  show  s  ail  the  slop  consonants  which 
ot^ur  in  ii}iai  position  in  this  dialed.  All  final  slops  are  unrclcased; 
th.'i  is  It}  s.r,  .  thi-r.  i  s  tv*  slop  bur.>l  or  <tsp'i  ruliuii.  riie  syiiib(dj[  [)  1 
rc'j iM'sent .H  a  coiis(»n.'int  a rt it  'i!alo<i  ssilh  iwo  ci iinpU’l i’  si  niul laneoii s 
c  losu  re.s.  is  <il  I  In-  bat  K  t-f  1  ho  m*  hi1  |)  w  he  re  [  K  J  i  s  a  r  I  it  u  htled 

a  Of  I  I  Ilf  other  is  .i  (  i  In-  lips  w  r  re  •'  i>  i  i  s  a  r  I  i  t  ti  I  al  ed,  Th*/  sy  inbol['''«'  ] 
repre  .sent  s  a  lt;w  bath  nnrtauideti  \tiv.i'l:  (he  s  vmbtd  [  •  .]  j'ep  re  sen!  s 
a  ‘iotnewlial  t!i  pht  hone  i /.e*  I  nii<!”!>at  1-  rtcnided  'vt*wel.  ’riie  synibol  [*;] 
iitf.iiis  that  the  pret  etlinu  V’V  el  is  Ion-,’. 

I> 

i  pop 

I  1  -t(  ! 

t  -iM 

|)  t  pop 

The  above  li>i  shows  that  wi  have  uair  plioiK‘l  ieal  ly  dii'ft  rent 


lOS 


final  stops;  the  problem  is  to  decide  how  many  different  phonemes 
there  are.  This  means  we  must  establish  which  phonetic  differences 
are  relevant  (  i,  e,  word.r.differentiating)  in  this  language.  The  phonetic 
differences  are: 

(1)  The  difference  between  [p]  and  ft] 

(  2)  The  difference  between  [p]  and  [k] 

(3)  The  difference  between  [p]  and  [p] 

(4)  The  difference  between  [t]  and  [k] 

k 

(5)  The  difference  between  ft]  and  [p] 

(6)  The  difference  between  [k]  and  [p] 

The  best  method  of  establishing  the  fact  that  the  differences 
between  two  phone  oiasses  are  relevant  is  to  find  a  minimal  pair.  A 
minimal  pair  consisls  of  two  words  wWch  are  identical  in  every 
speech  aovmd  except  one.  If  a  native  speaker  says  that  the  two  members 
of  tlii.e  pall'  sound  different  then  the  phones  which  are  different  in  the 
two  words  belong  to  different  phonemes. 

In  the  above  list,  we  have  only  one  minimal  pair  ~  [thop]  and 
[tAop],  Tlie  existence  of  this  pair  tells  us  that  [pf  and  [p]  do  not 
belong  to  the  same  phoneme. 

AltU<Jugli  Lhei'e  are  no  other  minimal  pairs,  there  are  some  near 
minimal  pairs.  A  neaimminlmal  p.air  is  a  pair  which  is  identical 
in  some  segments  and  different  in  others.  In  using  a  nearMininiiinal 
pair  to  do  tlile  phonemic  analysis,  we  make  the  assumption  that  the 
differences  between  the  final  stops  of  two  Vietnamese  words  are 

‘  I-.* 

independent  of  the  diffurences  between  the  initial  cuiisonants  of 
th«se  words.  There  is  always  some  risk  involved  in  mailing  such  an 
assumption,  but  if  we  do  not  make  it,  we  cannot  continue  the  analysis. 

I  Oo 


The  assumption  ia  bolstered  by  the  fact  that  no  language  has  yet 
been  analyzed  in  which  differences  between  final  consonants  depend 
on  differences  between  initial  consCnants.  We  will  assume,  then. 


that  the  differences  between  the  [t]  of  [J  t]  and  the  [k]  of  [ty,*<k] 
are  not  dependent  on  the  differences  between  the  initial  consonants 
[  and  [ty].  We  decide  that  [t]  and  [k]  belong  to  different  phonemes 
because  they  appear  in  the  same  position,  in  final  position  after  the 
vowel  [*4]. 

In  order  to  compare  the  final  stop  of  with  those  of 

[J«<  t]  and  [ty  "(k],  we  must  make  the  further  assumption  that  the 
flifferences  between  the  final  stop  of  [f"<:p]  and  those  of  [j^a.  t]  and 
[ty»<k]  are  not  dependent  on  the  length  of  preceding  vowel.  Again, 
we  would  rather  not  make  assumptions  like  this,  but  there  is  no 
help  for  it.  Having  made  this  assumption,  we  compare  the  final 
stop  of  [f«<!p]  with  that  of  [|  xt]  and  conclvidc  that  they  belong  to 
different  phonemes.  Likewise  wc  compare  the  final  stop  of  [f<<:pl 
with  that  of  [ty‘<k]  and  conclude  that  they  belong  to  different  phonemes. 

Out  of  the  six  possible  comparisons  which  we  listed  earlier, 
we  liave  carried  out  four.  Wo  have  establislicd  that  the  following 
pairs  of  stops  cannot  belong  to  the  same  phoneme. 

(!)  [p]  and  [p] 

(2)  [p]  and  [t] 

(3)  [p]  and  [k] 

(4)  [t]  and  [k] 

The  fact  that  [p]  contrasts  with  [t]  and  [k];  and  [t]  and  [k] 
contrast  with  each  other  forces  ue  to  couclude  that  there  are  three 
seiDarate  phonemes,  /p/»  /t/j  /^A  The  fact  that  [p]  contrasts 


107 


with  means  that  [p]  cannot  belong  to  the  /p/  phoneme;  there 

k 

remain  three  possible  analyses  for  [p]. 

(1)  It  belJngs  to  neither  /p/,  /t/,  nor  /k/.  It  belongs  to 
a  phoneme  by  itself. 

(2)  It  belongs  to  /t/, 

(3)  It  belongs  to  /k/. 

Analysis  (1)  is  to  be  avoided  if  possible  because  we  prefer 
not  to  set  up  more  phonemes  that  we  need  to  account  for  all  the 
contrasts  of  the  language,  [p]  does  not  contrast  with  [t|  or  [k] 
since  [p]  occurs  only  after  rounded  vowels  while  [t]  and  [k]  occur 
only  after  unrounded  vowels.  Wc  therefore  reject  analysis  (1). 

This  leaves  analyses  (2)  and  (3).  Given  the  choice  of  grouping 
k 

[p]  with  [t]  or  grouping  it  with  [k]  wc  do  not  hesitate  to  group  it 
with  [k]  uinoe  phonetically  it  has  more  in  common  with  [k]  than 
with  [t]. 

It  may  be  asked  why  we  were  willing  to  assume  that  the  final 
coosoiiant  was  not  affected  by  the  initial  consonants  or  by  the  length 
of  the  preceding  vowel,  but  we  were  willing  to  assume  it  was 
atfeclud  by  tbe  pr<. ceding  vowel’s  being  rounded  l  at'.bc.r  than  unrounded, 
Tlds  is  a  reasenable  objection,  and  tlus  answer  lies  in  contiidering  the 
phonetic  detriits  caiudully. 

The  iiilliai  e.onuoiuint  is  not  a<’jaei'.nt  to  tlie  final  eonsoTiant 
and  aitliougl)  non-adjacent  vowels  sometimes  inflvienee  each  other 
dii'cetJy  (  i,  I*.,  witliout  changing  iiny  intervening  sound),  non^*adjaceiit 
ennsnnants  I'arely  do  so.  This  is  a  generali'^atiou  v'hich  we  believe 
holds  true  fuX'  all  languages.. 


lOH 


As  1‘or  the  final  consonant  being  affected  by  the  duration  of 
the  preceding  vowel,  this  does  happen  in  language,  but  the  common 
effect  IS  some  change  in  the  duration  of  the  consonant.  In  the  extreme 
cases,  the  consonant  is  dropped  completely.  We  know  of  no  cases, 
however,  where  the  place  of  articulation  of  a  consonant  has  changed 
audibly  because  the  preceding  vowel  was  pljonemically  long. 

Wh<;n  we  consider  the  effect  of  a  rounded  vowel  or  .semivowel  on 
an  adjacent  consonant,  however,  the  situation  is  quite  different. 

Such  influcnc(‘s  are  common,  and  we  know  of  one  case  in  which  a 
k  followed  by  w  became  pp.  ’I’his  change  took  place  in  very  early 
Greek,  when  tin'  Proto-Indo-European  word  for  "horse"  became  the 
Greek  liippos,  but  remained  almost  uncliangc'd  in  the  Latin  equus 
(pronouiu  cd  ekwus). 

We  an*  <  iting  this  historical  example,  nut  to  establish  the 
origin  ol  the  Vieii..  •  u'se[.p]  ,  but  to  show  tiuit  (lii?re  is  phoiietU' 
.Similarity  between  a  p  and  a  vowel  or  .semivowel  which  Involves 
lip-rounding,  (We  are  here  making  the  ;i s sunqil  i on  that  iC  one  sound 
has  been  substituted  for  ano!lu*r  in  any  language.',  tlu  re  mu.sl  be 
some  point  ol  plionelic  similarity  belsveeii  the  original  sound  and 
the  substituted  sound.)  V/v  are  not  eonei'riied  with  lio\v  llulpj(.ame 
into  beijig,  but  how  it  funcUous  in  the  language. 


109 


APPENDIX  B 


-  CHARTS  OF  THE  CONSONANT  CATEGORIES 


110 


Consonants  shown  as 
Reimann  leaves  of  the 
plane  on  following  pages 


STOPS  AND  NASALS 


Guttural 

Palatal 

Alveolar 


Dental 

T 

t* 

d 

d» 

n 

Fr.  tiena 

Ger,  du 

neun 

Labiodental 

P 

b 

b» 

m 

P 

> 

•> 

7 

Labial 

n 

r»l 

1, 

t 

SIBILANTS 


Voiceless  Voiced 

Unaspirated  Aspirated  Unaspirated  Aspirated 


Voiced  with  nasal 
resonance 


Guttural 

Palatal  J 

Alveolar  s 

Dental  s 

Labiodental 

Labial 


113 


AFFRICATES 


Voiceless  Voiced 

Unaspirated  Aspirated  Unaspirated  Aspirated 

Guttural 

j 

Palatal  c 

chin 

Alveolar 

Dental 

Labiodental 
Labial 


Voiced  with  nasal 
resonance 


114 


SPIRANTS 


Voiceless  Voiced 

Unaspirated  Aspirated  Unaspirated  Aspirated 

Guttural  ^ 

J 

Palatal 


Alveolar  |>  •g 

Dental  t 

thin  this 

Labiodental  f  v 

fine  vine 


Labial  ^ 


Voiced  with  nasal 
resonance 


115 


Guttural 

Palatal 

Alveolar 

Dental 

Labiodental 

Labial 


LATERALS 


Voiceless  Voiced 

Unaspiratcd  Aspirated  Unaspirated  Aspirated 


Voiced  with  nasal 
resonance 


L 

bulk 


L 

dull 


1 

Ut 


1 

4. 

well 


116 


APPENDIX  C 


In  making  a  palatogram  a  plate  is  shaped  so  that  it  conforms 
to  the  contours  of  the  roof  of  the  mouth.  This  is  then  coated  with 
a  substance  which  changes  appearance  when  it  is  touched  by  the  tongue. 
After  this  ha.s  been  fitted  into  the  subject's  mouth,  he  articulates  the 
sound  which  is  under  investigation,  and  the  plate  is  immediately 
removed.  By  examining  the  plate  and  determining  just  where  its 
appearance  has  changed,  we  can  establish  which  parts  of  the  tongue 
make  contact  with  the  roof  of  the  mouth  during  tlie  articulation  of 
the  sound  under  study. 

The  palelogratiis  in  llic  tcxi.s  are  i )  lust  rations  ot  what  we 
believe  the  originals  to  be  like,  ratlier  than  original  plates  made 
by  us. 


117 


APPENDIX  D 

Belore  giving  Meyer>s  conclusions,  we  will  list  the  tense  and 
lax  phone;  this  is  as  close  as  we  can  come  to  defining  the  terms. 

The  "tense"  consonants  include  all  voiceless  consonants;  the 
lax  consoiiants  include  all  voiced  consonants  except  the  liquids  and 
nasals.  The  liquids  and  nasals  are  neither  tense  nor  lax.  The  tense 
vowels  include  the  vowels  in  the  following  words;  wife,  way,  leaf, 
lose,  lobe,  and  cloud.  The  lax  vowels  inckide  the  vowels  of  the 
loilowing  words:  If,  loss,  lose,  gas,  push,  should,  and  bud.  The 
list  of  tense  vowels  is  incomplete;  Meyer  gives  all  his  examples  in  the 
phonetic  script  used  sixty  yeans  ago.  Most  of  the  words  are  recog-, 
nizablo,  but  a  few  are  not.  There  are  three  tense  vowels  which  we 
have  not  listed.  Some  of  Meyer's  conclusions  ai)ply  to  specific 
segments  of  the  speech  wave  as  wo  have  dividctl  it,  but  many  do  not. 
Those  of  his  conclusions  which  refer  to  only  one  type  of  segment  are 
given  in  the.  soctloiiH  in  which  those  segments  arc  described.  Those 
wtiicli  group  togetlicr  two  or  more  of  our  segments  are  as  follows: 
a.  Consonant  durations 

(1)  The  duration  of  initial  lax  consonauts  in  one-  and  two- 
syllable  words  is  slightly  sliorter  than  tlio  duration  of  initial 
tense  consonants.  The  difference  is  greiite.r  for  consonants  in 
medial  and  final  position, 

(2)  Apparently  the  duration  of  an  initial  consonant  does  not 
depend  on  the  quality  of  the  following  vowel. 

(3)  Initial  consonants  in  two-syllable  words  are  slightly  shorter 
than  in  cne-sy liable  words;  medial  and  final  consonants  in  two- 
syllable  words  are  significantly  shorter  than  in  one-syllable  words. 

1  1  « 


(4)  The  duration  of  a  final  conaon^nt.is  dependent  on  the  quality 
of  the  preceding  vowel:  the  higher  the  tongue  position  for  the 
vowel,  the  longer  the  final  consonant. 

Vowel  durations 

(1)  A  lax  vowel  is  shorter  thaii  a  tense  vowel. 

(2)  The  higher  the  tongue-position,  the  shorter  the  vowel. 

(3)  A  vowel  before  a  tense  final  consonant  is  shorter  than  a 
vowel  before  a  lax  final  consonant, 

(4)  A  vowel  before  a  stop  is  shorter  than  a  vowel  before  a  fricative, 

(5)  2.>  nr,  n,  and  _^tend  to  shorten  the  preceding  vowel. 

(6)  The  lengthening  of  a  lax  vowel  under  influence  of  the  final 
consonant  is  slightly  less  for  a  naturally  long  vowel  than  for  a, 
naturally  short  one. 

(7)  The  lengthening  of  a  tense  vowel  under  influence  of  the 
following  consonant  is  considerably  less  for  a  naturally  long 
vowel  than  for  a  naturally  short  one. 

(8)  The  different  vowel  durations  before  different  consonants 
cannot  be  explained  as  an  attempt  to  kuep  the  syllable  duration 
or  the  rhythm  constant, 

(9)  The  duration  of  the  stressed  vowel  in  a  two-syliabie  word 

is  considerably  shorter  than  the  duration  of  the  stressed  vowel  in 
a  one-syllable  word, 

(10)  A  vowel  before  a  tense  medial  consonant  is  shorter  than 
a  vowel  before  a  lax  medial  consonant. 

(11)  The  unstressed  vowel  of  a  two-syllable  word  is  long, 

(12)  A  vowel  before  a  fricative  (spirant,  sibilant,  [w],  or  [hj 
is  longer  than  a  vowel  before  a  stop. 


Meyer's  conclusions  are  not  directly  applicable  to  the  present 
model  both  because  Meyer  did  not  breah  the  speech  wave  down  enough 
for  our  purposes  and  because  he  analyzed  British  English,  it  Is  not 
enough  to  know  that  a  vowel  hao  been  lengthened;  we  need  to  know  which 
parts  of  the  vowel  have  been  afiected. 


APPENDIX  E 


i)  Duration  of  Nasals 

Using  his  tape  recorder,  Harrell  reports  that  when  the  word 
mump  is  played  backwards,  the  resulting  combination  is  heard  as 
mump  (Harrell,  1958).  He  explains  this  by  suggesting  that  an  initial 
nasal  is  considerably  shorter  than  a  final  nasal  (at  least  in  English), 
except  in  special  circumstances. 

In  this  case,  however,  mump  is  pronounced  with  an  unreleased 
g  in  which  the  closure  is  not  followed  by  a  noise  burst  (as  in  a  quick 
pronunciation  of  rump  rather  than  oompah  I ).  According  to  Harrell’s 
hypothesis  the  only  thing  which  makes  an  apparently  final  nasal  as 
short  as  an  initial  one  is  in  fact  a  following  voiceless  unreleased 
stop  as  in  mump.  In  this  special  case  the  stop  is  articulated  in  the 
same  place  or  "homorganic"  with  the  nasal. 

Since  the  p  in  such  words  as  bump  is  frequently  unroloased 
it  may  bo  necessary  to  instruct  a  machine  for  speech  recognition 
that  an  apparently  final  nasal  which  is  no  longer  than  an  initial  nasal 
should  be  interpreted  as  a  combination  of  nasal  plus  homorganic 
voiceless  stop, 

Meyer  reports  (Meyer,  1903)  that  a  final  nasal  is  approx¬ 
imately  one  and  one  half  times  as  long  as  an  initial  nasal.  Use 
I.ehiste,  however,  reports  data  which  appear  to  contradict  thisf  Lehiste, 
1900).  Using  comparison  of  spectrographs  she  performed  experiments 
to  discover  characteristics  of  the  sound  wave  that  accompanied  what 
is  known  as  juncture.  (The  difference  between  an  ice  man  and  a  nice 
man  is  that  an  ice  man  has  a  juncture  after  n,  and  a  nice  man  has  a 


IZI 


juncture  before  ^  )  According  to  Lehiate,  the  Inltiel  n  of  nice  in  the 
I^eae  »  nice  men  la  twice  ea  long  es  the  final  n  of  an  in  the  phraae 
an  ice  man.  She  auggesta  that  thia  difference  in  the  duration  of  n  is 
an  important  cue  for  dlatinguiahing  these  two  phrases.  That  is,  an 
Initial  n  is  recognised  as  being  initial  because  it  is  longer.  Meyer 
reports  also  that  the  n  In  an  aim  ia  shorter  tljan  the  n  in  a  name, 
but  he  does  not  comment  on  the  fapt  that  this  seems  to  contradict 
his  statement  that  final  nasals  are  longer  than  Initial  ones. 

This  is  a  very  complex  problem  and  it  requires  further 
research.  The  answer  may  be  that  the  word  an  in  both  these  examples 
is  completely  unstressed  while  the  words  nice  and  name  are  both 
strongly  stressed.  The  initial  n'a  may  have  been  lengthened  because 
they  are  in  the  stressed  syllable.  This  is  the  only  hypothesis  which 
occurs  to  us  at  present. 

ii)  Relative  Duration  of  the  Onglide,  Steady-State  and  Offglide  of 
Liquids  and  Semi -Vowels 

Working  with  the  Pattern  Playback,  Liskor  attempted  to 
make  the  machine  produce  intervocalic  r,  1,  and  w  artificially, 
(Intervocalic  means  occuring  between  vowels;  the  actual  sounds  Lisker 
tried  to  reproduce  were  irl,  ara,  uru,  iyi,  ay  a,  uyu,  etc. ) 

In  synthesizing  artificial  r,  jr,  and  w,  Lisker  discovered  that 
the  most  recognizable  sounds  were  created  when  he  drew  his  spec¬ 
trograph  so  that  the  onglide,  steady-state,  and  offglide  were  of  equal 
duration.  In  synthesizing  intervocal  1,  however,  the  most  natural 
sound  occurred  when  the  onglide  and  offglide  were  drawn  slightly 
shorter  than  the  steady-state  (Lisker,  1957). 


12E 


In  another  experiment  work  done  at  Haakins  on  syntheaicing 
initial  liquids  and  semitvowela  yielded  the  following  information. 

The  quality  of  the  synthetic  ^  was  improved  when  the  first-formant 
transition  was  made  very  short;  £,  and  w  were  not  adversely 
affected  bv  having  the  first  formant  transition  short.  (0*Connor, 
Gerstman,  Liberman,  Delatre,  and  Cooper,  1957.) 

iii)  Duration  of  Spirants  and  Sibilants 

Experimenting  with  the  words  use  [yus]  (noun)  and  use  [yuz] 
(verb).  Denes  made  tape  recordings  of  this  word-pair  and  established 
that  the  vowel  preceding  [z]  was  considerably  longer  than  the  vowel 
preceding  [s],  (Denes,  1955) 

The  next  step  was  to  take  the  segment  [s],  shorten  it,  and 
put  it  after  the  vowel  of  [iu  z];  and  also  to  take  the  [z],  lengthen  it, 
and  put  it  after  the  vowel  of  [iu  s].  He  reports  that  both  combinations 
sounded  like  perfectly  normal  words.  This  would  indicate  that  the 
duration  of  a  sibilant  is  an  important  cue  for  identifying  it  as  "voiced” 
or  "voiceless". 

Meyer's  figvires  for  sounds  he  defines  as  tense  [f,  ^  ,  s]  compared 
with  sounds  he  defines  as  lax  [v,  show  that  tense  spirants  and 

sibilants  are  longer  than  lax  ones.  They  also  show  that  [f,  s,  /  ,  v, 
and  z]  are  slightly  longer  after  lax  vowels  than  after  tense  ones, 
while  [^]  is  considerably  longer  after  lax  vowels.  He  also  reports  that 
fricatives  (i.  e.  spirants,  sibilants,  [h],  and  [w])  in  general  have  a 
greater  duration  than  stop  closures. 

iv)  Duration  of  Stop  Closures 

Comparing  spectrographs,  Eli  Fischer-J^rgensen  reports  that 
in  Danish  the  closure  of  t,  k  is  shorter  than  that  of  b,  ^  (Fischer- 

12 


Jj^rge&sen,  19S4).  Other  work  contredicta  thia,  however.  Leigh 
Llaker  baa  recorded  werda  with  intervocalic  b  (aa  in  rabid).  He 
then  cut  out  that  aection  of  tape  which  had  the  stop-gap  on  it.  In  its 
place  he  inserted  a  period  of  silence  longer  than  the  styp-gap  of  [b]. 

As  a  result  the  word  was  heard  as  having  a  [p];  rabid  bectune  rapid. 
Conversely  if  a  tape  of  [p]  has  its  stop-gap  cut  out  and  a  period  of 
sileuce  shorter  than  the  stop>gap  of  [p]  is  inserted,  a  [b]  is  heard; 
rapid  is  changed  back  to  rabid.  (Lisker,  1957). 

It  should  be  noted  that  this  study  was  confined  to  stops 
between  vowels.  Whether  similar  results  would  be  obtained  for  stops 
at  the  beginning  or  end  of  words  is  not  known.  The  discrepancy 
between  Liaker's  findings  and  those  of  Fischer-J^rgensen  also  requires 
investigation.  Since  one  study  was  made  for  Danish  and  one  for 
English  it  is  quite  possible  that  both  are  valid, 

Meyer  reports  that  stop  closures  are  shorter  than  the  class 
of  sounds  which  he  calls  fricatives  (i.  e.  spirants,  sibilants,  [w], 
and  [h]).  He  also  reports  that  the  closure  of  p,  ^  is  greatest,  that 
of  k,  jj  nnvt,  and  that  of  t,  d  least,  .and  that  the  closures  £,  t,  k  show  mure 
variation  in  duration  than  any  other  class  of  speech  sounds. 

v)  The  Duration  of  Stop  Bursts 

There  is  some  evidence  that  duration  differs  from  stop  to 
stop  in  other  languages.  Eli  Fischer-J^^rgensen  (Fiacher-J^rgensen, 
1954),  reporting  on  Danish  stops,  says  £  has  a  greater  duration  than 
d,  which  in  turn  has  a  greater  duration  than  b.  Similarly  k  is 
gr-ater  than  t,  which  is  greater  than  p.  It  should  be  noted  that 
in  Danish  b,  d,  are  sometimes  voiceless  and  £^,  kare  sometimes 

1Z4 


voiced.  It  may  or  may  not  be  relevant  to  a.atudy  of  English. 


In  producing  stop  consonants  by  use  of  machines  Haskins 
Laboratories’ reports  indicate  that  the  duration  of  the  synthetic  burst 
was  ,.015  seconds  (Cooper,  Delattre,  Liber  man,  Borst,  and 
Gerstman,  1952), 

vi)  Duration  Between  the  Stop  Burst  and  the  Beginning  of  the  First 
F  ormant 

Using  the  Pattern  Playback  Haskins  Laboratories  reports 
that  a  synthetic  speech  pattern  which  listeners  perceive  as  voiced 
stop  plus  vowel,  such  as  bah  can  be  changed  to  one  which  listeners 
perceive  as  voiceless  stop  plus  vowel,  such  as  pa  simply  by  cutting 
off  the  beginning  of  the  first  formants  (Liberman,  Delattre,  and 
Cooper,  1958),  The  authors  point  out  that  since  the  voiceless 
stops  in  English  are  aspirated  (pronounced  with  a  half-heard  h 
as  ill  gat)  while  voiced  stops  are  not  (as  in  jiat),the  time-lag  between 
the  stop  hurst  and  the  very  beginning  of  the  first  formant  is  probably 
thought  to  be  a  period  of  aspiration;  thus  the  stops  are  heard  as 
voiced, 

vii)  Duration  of  Vowel  Ongiide,  Offglicle  and  Steady-State 
a.  Ongiide 

Again  using  their  artificial  speech  machine,  Haskins 
reports  (Liberman,  Delattre,  Gerstman,  and  Cooper,  1956) 
that  a  pattern  which  is  perceived  as  [b&],  as  in  bet,  changed  to 
one  perceived  as  [wc],  as  in  wel^  when  the  duration  of  the  ongiide 
or  transition  is  increased.  If  the  duration  of  the  ongiide  is 


IZ5 


increased  stiU  further,  the  result  isfuej^oe,  e).  (It  should  be  noted 
that  the  pattern  which  produced^  bfij  in  this  experiment  consisted 
of  formant  transitions  followed  by  steady -state.  There  was  a  voice- 
bar  for  the  voiced  slops,  but  no  slop  burst),  Tbe  pattern  forj^gfcj  as  in 
get  ,  similarly  yieidejl  [^y6j  ,  as  in  yet  andl^L^-],  as  in  when  tlie 

duration  oi  Un?  ongUdi-  (l  ransi  t  itui )  w<ls  in<‘ i'ea.scd ,  Cbl,  ay  in  bet  ,  was 
tx’ansforiiied  (o[wj  ,  as  in  w  t.-\ .  wlu-n  (h»'  tluralion  oi  llie  iransilion 
ex(<*<‘ded  '10  onds;  QaC-]-  s  in  .  in  canu*  [^y  as  in  yet  , 

at  lilty  to  sixty  iin.Ui.si.*ror-<l.s. 
b.  St i*ad)  •  .Si .d«- 


'll,.- 

j  d  i  >;  i'\  dii-n 

.  II.  a 

•  MM  u  a  f>-  w 

a.i-.i  dMtdii  jji  e  beLween  tlie 

speech  ot 

>)Miil  Id*  i  n  e  »• ', 

.  .Ifl'*  1  ' 

...t  ..1  .-ihd.- 

A. . .  .dd:  i  .S  1  he  dural  iOM  dl 

l]|d  VO,N  V  » 

;,l  « •.il  1  V  -  -^1  <1 1 » 

.  i'l.. 

..  d'  ..h-.M  . 

■  OU.'U  1, 1  d)  j  .liotdg  JU.pll  S  ( >1 

ype<  1  r 

K<'pp.  .ii>« 

.  u  u  • :  • » 1  .  <  •.*  1 ' 

i  (  i  r.-.',. 

I ' 

.  .  .  c  ^  lb. 

-t  V  !  St  1)1  e  SlM'di  ll  b>  Pol  LC  L' 

W b.i  \  •  •  l«  -  .  ..  I .  ««  ■ !  .  .  M  >>{•>.-  :  I ' ;  .1  i  II ,')<  'ii  |  bf  ni  s  peee  h . 

rin‘  jiM.'O  ,>-.1  t  ;  i.  i  ill-' -!•  !!•  I  ■  ih«  .|).  i  I  !;•  r.M  I  >  j.  »')  Soul  be  J'n 

•i) b  ,1  jid  ( I.M, . .  »i  <  .1 »  I  ■  I  I  .  • '  .<  <1  « ;  I  - 1 . f  I  .  1  u '  (  )h  S'  'I  I  [  h  r  r n e  i‘  ha;; 

ouu  b  !•  - 1  .  ■  I  I  •  I !  -  .  lU  .  i;  s b . ,  I; « '  fill  r,  1 1  1  dll  « >r  I  lie 

;..U  .ul\  I-I-Il.  . . .  C  i'..M  .1  Im  ,m  iilMlely  llieji.su  red 

'  M  .1  '.p.'  I  I  .1 1  \  .  !  •  ••  .  ! .  ' .  ;  d  ^  .-,,1  d,!  I  d  ml 

.•  ...  ,1.  .  .  .  . 

Il  -S  •  •.  .  ,  :  .  .  .  jui  Mdl  I. 

'll.-  -...d..-  ..i.  I  . . .  ,i,jf  Id  lid- 

■  li  'V.  e  ■  ■  ,  .  '  ■  I  I-,  .  ;  .  .  .  ,  -  ,  ;  .  ,  . ,  I  < ,  I  n  ,  IH  >  I  d  . , I 

dl-.  -  :.1  .1  ■  !  •  ..  .  I  o.  .  t  ••  i  •  •  .  e  •  •  ,  -  •  i  ,  VV  Md'  y  . 

'  b'-  r-dl..  f.  I.  .  .  .  .  I  ,  .  . . .Id,:.,,,.,,  1),,.  rr  s  i.in  ll 

•  <  .-  I*  .\.  I  r.i  li i  I .  !  . . .  .  .  I  •  •  I  iMj..  d  u.)  ir  n  \  t  a  I  1  hr 


I .  . . « i  • 


c.  OflEgUde 

Lehiste  and  Peterson  report  (Lehiste  and  Peterson  2,  I960) 
that  American  English  vowels  may  be  divided  in  to  two  groups 
using  the  criteria  of  the  relative  durations  of  steady-state  and 
offglide.  One  group,  which  they  call  the  tense  vowels,  consists 
of  i.<3C.,  a,  JO  ,  u  ;  the  other  group,  which  they  call  the  lax  vowels, 
consists  of  1,&,3,  .  The  tense  vowels  have  an  offglide  approximately 

half  as  long  as  the  steady-state;  the  lax  vowels  have  an  offglide 
more  than  one  and  one-fourth  times  as  long  as  the  steady-state.  It 
should  be  noted  that  all  the  subjects  who  were  used  for  this  study 
spoke  the  same  dialect. 


APPENDIX  F 


The  first  method  of  reconstruction  differs  from  the  ofahers  in  that 
it  does  not  require  any  data  except  the  language  itself,  considered  at 
one  point  in  time.  For  this  reason  it  is  called  internal  reconstruc¬ 
tion.  We  have  already  cited  the  alternation  of^d  ,  d,  and  t  as 
past-tense  markers  in  English.  We  have  said  that  this  is  the  result 
of  two  sound-changes,  one  in  whichgd  became  d,  and  one  in 
which  d  became  We  know  that  these  two  changes  took  place 
because  we  have  eighteenth-century  speech  manuals  which  warn  their 
readers  not  to  omit  the  vowel  of  the  past-tense  suffix.  Even  if  we 
did  not  have  these  manuals,  however,  we  could  still  reconstruct 
part  of  the  change  by  considering  the  nature  of  the  alternation. 

In  any  reconstruction  we  begin  by  assuming  that  one  of  the 
alternating  sounds  is  the  original  one.  If  we  cannot  arrive  at 
linguistically  probable  results  by  this  approach,  we  then  assume  that 
all  of  the  alternating  sounds  are  innovations.  In  this  English  example, 
then,  we  will  assume  that  hcL  .  d.  or  t  was  the  original  past- 
tense  suffix.  If  we  say  that  ^  was  the  original  suffix,  we  must 
explain  why  ^became  d  after  the  vowel  e  in  laid  ,  but  remained 
unchanged  after  the  same  vowel  in  late  .  Tf  1  was  the  original 
suffix,  laid  and  late  were  homonyms,  and  laid  underwent  a  phonetic 
change  while  late  remained  unchanged.  This  conflicts  with  our 
basic  principle  that  sound-change  is  regular,  so  wo  must  reject 
the  assumption  that  ^was  the  original  suffix. 

This  leaves  us  with  the  assumption  that  the  original  suffix 
wasa d  or  d.  If  we  assume  that  it  was  d,  wc  also  assume  that 
for  at  least  a  brief  period  of  time,  speakers  of  the  language  consistently 

pronounced  the  cluster  td _ in  the  past  tense  of  the  verb  taste.  This  is 

possible,  but  highly  improbable.  It  is  very  doubtful  that  the  original 
suffix  was  d. 

This  leaves  us  with  the  assumption  thatjcl  was  tlie  original 
form.  There  arc  no  difficulties  involved  in  this  assumption.  The 
vowel  could  drop  out  quite  easily  after  all  sounds  except  d  or  t. 

The  loss  of  the  vowel  would  be  a  simplification  of  the  articulatory 
movements,  and  many  sound  changes  are  simplifications  of  this  type. 
After  the  vowel  loss  (or  simultaneously  with  it)  the  £  became  t  when 
it  was  next  to  a  boiceless  consonant.  This  sound  change  is  also  a 
simplification.  The  hypothesis  thatad  was  the  original  suffix 
involves  our  assuming  only  sound  changes  which  are  probable.  The 
assujiiption  that  ^  was  the  original  suffix  involves  our  assuming  a 
highly  improbably  state  of  affairs  before  the  change.  The  assumption 


128 


that  t  was  the  original  suffix  involves  our  assuming  a  sound  change 
which  contradicts  one  of  our  basic  postulates.  We  therefore  conclude 
thatad  was  the  original  suffix. 

The  forms  ad  ,  d,  and  ^  stand  in  a  special  relationship  to 
each  other.  They  are  different  phonetic  forms  of  the  same  meaningful 
element,  the  suffix  for  the  past  tense  ;  which  of  them  will  appear 
depends  on  the  phonetic  shape  of  the  last  sound  in  the  verb.  This 
relationship  is  called  morphophoncmic  alternation.  There  arc  other 
cases  of  this  type  of  alternation  in  English;  one  of  them  is  the  plural 
suffix,  which  is-az  after  a  sibilant  or  affricate  (as  in  glasses)  ,  £ 

after  a  vowel  or  voiced  consonant  (as  in  chairs)  ,  and  s  after  voiceless 
consonant  (as  in  books).  Most  morphophonemic  alternations  are 
entirely  the  result  of  sound  changes;  some  are  partly  the  result  of 
changes  by  analogy.  The  difference  between  sound  change  and 
morphophonemic  alternation  is  that  sound  change  is  process  which 
takes  place  over  a  period  of  time,  while  morphophonemic  alternation 
is  a  situation  which  exists  at  one  time  in  the  language.  A  machine 
for  automatic  speech  recognition  will  need  both  a  list  of  the 
morphophonemic  alternations  of  lliat  particular  language  and  a  list 
of  sound  changes  which  have  taken  place  in  any  language.  The  list 
of  morphophonemic  alternations  would  make  il  unnecessary  to 
make  a  separate  statement  about  which  alternate  appears  with  each 
word.  The  list  of  sound  <;hange.s  will  predict  the  normal  .sound 
variations  of  speech. 

The  second  method  of  establishing  what  sound  changes  have 
taken  place  is  to  compare  descriptions  of  the  same  language  made 
al  different  times.  Ij'or  this  comparison,  we  use  only  descriptions 
wlii<  h  are  contemporary  with  tlie  speed)  being  described.  In  general 
the  descriptions  which  wo  liave  were  made  for  one  of  two  reasons.  The 
writer  was  either  giving  inslructions  on  how  to  speak  like  a  well- 
edui  aleil  man  or  he  w.as  denionsl  )'aling  the  neci'ssily  for  a  six'lling 
reform.  Most  of  our  deseriplions  of  Latin,  Greek,  and  eighteoilh 
ceiiture  Eiiglisli  belong  to  the  first  category,  while  our  best  description 
of  Old  h'elandi)'  Indongs  to  tl)e  second.  Boti)  categories  have 
particular  dr«|wbaeks.  The  authors  of  pronune iation  manuals  some¬ 
times  make  up  rules  which  have  never  before  existed,  while  the  authors 
who  rei:oinmend  a  spelling  reform  are  primarily  concerned  v/ith 
having  a  distinctive  spelling  for  each  plionetically  distinct  word.  Their 
goal  is  not  to  record  ail  phonetic  differences,  but  all  phonemic 
differences.  Any  statement  which  a  writer  makes  about  the  pronunciation 
of  his  language  must  he  carefully  cheeked  against  the  written  records 
of  that  language,  but  these  statements  are  nevertheless  very  valuable 


l^9 


Siij-iijilijiiiiitL 


for  Indicating  what  sound  changes  have  taken  place. 

The  third  sounce  of  information  about  sound  changes  is  written 
records.  Although  far  better  than  nothing,  these  also  have  their 
drawbacks.  One  problem  is  that  it  is  frequently  difficult  to  determine 
what  particular  sound  a  given  symbol  or  group  of  symbols  is  supposed 
to  represent.  We  have  very  strong  evidence  for  assuming  that  Indo- 
European  had  an  s  and  that  Old  Icelandic  had  an  r  in  place  of  that  k 
in  some  phonetic  environments.  This  means  tliat  s  became  r  suiue- 
time  between  Proto-Indo-European  and  Old  Icelandic.  We  liave  runic 
inscriptions  from  the  period  when  this  phonetic  change  was  siilj 
taking  place,  but  we  do  not  know  exactly  what  sounds  the  runes 
represent.  There  are  three  runes  in  question  liere.  Cue  occurs  in 
IJiose  places  where  s  did  not  becimie  a;  tlio  ser-md  api>cai’L;  in  tln  si,-  [il. 
in  which  s  did  not  becuiue  r;  the  tlurd  appears  in  those  phiccs  win.  re 
w(j  assume  tliat  there  was  an  r  in  lndo*European  wiiicli  remained  ia 
Old  Icelandic.  If  we  knew  what  sound  the  rune  foi’"y  hccuiuing  i" 
represented,  we  would  know  the  phonetic  .stages  oJ  this  very  v.oii,’nH.(i 
sound  chang<',  but  all  wo  can  say  definilely  is  IliaL  tlicro  wcji;  three 
separate  sounds  in  Proto -Nor so  (.t)ie  language  uf  IIk.'  andinc'daii 
runic  inscriptions).  Most  phoneticians  assuun-  Ihal  wdicu  s  hcn-uim  i,  i 
th(!  intermediate  stage  is  s,,  and  this  seenis  prol>able,  but  we  (  aim.  i 
[)rov<'  it  from  written  records. 

The  second  drawback  to  writtcui  records  i.s  Dial  spelling  is 
usually  .standardized  and  dA»es  nol  necessarily  rel'lecl  cnuiempr»  I’u  r> 
prunun<‘iution.  It  would  be  oxtiUMuely  pr‘d»ably  j  icpos  si  b  1  .■  i 

ri’ccinstriict  the  proinuu-ial.ion  of  Morlern  Isuglisb  u-.Tiig..  only  Wiith  u 
lujcojuls  oLlier  than  dicliuiiariea  and  speiM  )i  inauual!.. 

in  analyzing  Vv'i-iltcn  re  cord:;,  We  c.an  valiudvh-  inj'.!  lurjcj  lei 

by  paying  eareful  attention  to  non-standard  s))ellinga  (ini w sim;  1 1 i iigs ) 

A  non -si ao'hi r<l  si>tdlijig  is  s(>nH.-U i u (  .s  (  liKSi.r  ir*  llu  j*h<an.  lit  l  e.iliiy 
th?m  the  standard  is.  Tlu?  iiieorrecl  net?  leiltrcls  piomnnictiuu)  nnu 
aeciir.iti  ly  Ilian  the  corret  I  knet!  does, 

'the  j'ourlh  .siHirte  of  inft>riijaLion  aboui  jonnd  eiianges  i  ili., 
'■(n.Tip.'iri.in.n  of  iimdern  tliulei  Is  of  the  same  h'.ii|>ii.i.ge.  When  Kvo 
dialetly  show  phoiittLit  ili f f e r tuu- e s .  it  is  obviou.s  lhal  one.  or  both 
liavo  luidergojnr  sonnti  t  !iiing(?.s.  One  erileritin  I'uv  <hn.  iding  w  hal 
cluu\g(?s  have  laktui  place  is  sim|;licity.  J.f  two  or  more  Jiialyses  .ceem 
equally  probable  phonetically,  we  assume  ihe  correwlncdS  of  tlial  l.h. 
which  involves  tJie  smallvsl  number  of  changes. 


I  ill 


The  fifth  method  for  discovering  sound  changes  is  comparing 
the  earliest  written  records  of  two  or  more  related  languages  for 
the  purpose  of  reconstructing  the  parent  language.  The  most  extensive 
reconstruction  of  this  type  which  has  been  done  so  far  is  the  recon¬ 
struction  of  Proto-Indo-European  by  comparing  written  records  of 
Greek,  Latin,  Sanskrit,  Old  Church  Slavic,  Hittite,  Old  Irish,  and 
several  other  ancient  languages.  The  phonetic  accuracy  of  the 
reconstructed  Proto-Indo-European  forms  is  open  to  question.  Some 
linguists  say  that  these  reconstructed  elements  should  not  be  considered 
phonetic  representations  at  all,  but  simply  formulae  for  referring  to 
the  sound-correspondences  of  the  later  languages.  According  to 
this  approach,  "Proto-Indo-European  p"  is  not  the  phonetic  symbol 
p,  but  simply  a  formula  for  referring  to  that  sound  to  Proto-Indo- 
European  which  beaame  p  in  Greek,  Latin,  and  Sanskrit,  became  f 
in  Germanic,  disappeared  completely  in  Celtic,  etc.  Most  linguists 
do  not  go  quite  this  far.  but  there  is  general  agreement  that  all  Proto- 
Indo-European  reconstructions  should  be  critically  anaiy xcH  in  the 
light  of  phonetic  probability.  All  methods  of  reconstructing  sound 
change  involve  some  possibility  of  inaccuracy,  since  there  is  no 
substitute  for  direct  observation,  and  our  reconstructions  of  Proto- 
Indo-European  are  especially  likely  to  contain  errors,  since  they  are 
made  from  written  records  of  languages  wliich  are  now  dead, 


APPENDIX  G 


Martinet  views  linguistic  evolution  as  something  which  is 
regulated  by  the  continual  conflict  between  man's  expressive  needs 
and  his  tendancy  toward  minimal  mental  and  physical  exertion. 

(Martinet,  1952).  The  "evolution"  is  the  result  of  the  changes  in 
expressive  needs  which  occur  over  a  period  of  time.  Martinet  does 
not  attempt  to  analyze  in  detail  the  changes  in  expressive  needs. 

The  principal  effect  of  the  expressive  needs  according  to  him  is  that 
the  speaker  strives  for  clarity.  People  strive  to  speak  in  such  a 
manner  that  their  enunciations  can  be  understood  without  repetition. 

If  they  are  not  clear  enough  the  first  time,  and  the  listeners  ask  for 
a  repetition  or  an  explanation,  the  speakers  will  be  even  more  careful 
the  second  time.  It  is  this  process  which  results  in  some  measure  of 
uniformity  in  the  speech  of  people.  The  striving  for  clarity  is  a 
clearcut  f  actor  which  can  prevent  some  sound  changes. 

The  tendency  to  reduce  mental  and  physical  exertions  to  a  minimum 
provides  a  more  complicated  problem  in  determining  sound  change;  any 
change  which  reduces  one  type  of  exertion  is  likely  to  increase  another. 
The  extreme  of  articulatory  simplicity  would  be  to  have  two  dis¬ 
tinctive  speech  sounds  (phonemes),  one  a  vowel  and  one  a  consonant. 

All  words  in  the  language  would  consist  of  some  arrangement  of  these 
two  sounds  in  a  series  similar  to  that  used  by  binary  computers.  The 
number  of  permitted  phonetic  variations  of  each  phoneme  would  be 
extremely  large,  and  this  would  save  the  speaker  the  trouble  of 
having  to  articulate  carefully.  On  the  other  hand,  the  words  of  this 
language  would  be  excessively  long,  and  any  utterance  would  require 
a  great  deal  of  time  and  effort.  The  other  extreme  would  be  a  language 
with  as  many  distinctively  different  sounds  as  the  human  ear  can  per¬ 
ceive.  Every  sound  would  have  to  bo  articulated  very  carefully,  but 
it  would  be  possible  to  have  very  short  words  and  utterances.  Neither  of 
those  extreme  cases  exists  in  a.ny  natural  language.  In  actual  practice 
all  languages  require  some  precision,  l>ul  none  use  more  than  a  small 
fraction  of  all  possible  phonetic  distinctions.  This  compromise  requires 
less  exertion  from  the  users  of  the  language  llian  either  of  the  extreme 
situations  outlined  above. 

Exertion  is  further  reduced  by  combining  several  distinct  types 
of  articulation  to  form  a  much  larger  number  of  phonemes.  The 
efficiency  of  combining  distinctive  eharacturislie s  into  phonemes  is 
ele.ar  if  we  consider  an  imaginary  language  with  four  eonsonanl  phonemes, 
each  of  which  has  only  one  characteristic  feature:  (1)  dental,  (2)  nasal, 

(3)  voiceless,  (f)  s,.iic.i... 


Although  each  phoneme  has  only  one  distinctive  characteristic, 
each  phone  (speech-sound)  must  have  many  non-distinctive  character¬ 
istics  because  the  various  articulatory  organs  must  be  in  some  position, 
and  any  position  affects  the  quality  of  the  resulting  sound.  Moreover,  in 
the  phonemic  system  under  discussion,  the  characteristic  which  marks 
one  phoneme  must  not  occur  with  the  allophones  of  any  other  phoneme 
as  a  non-distinctive  characteristic.  This  is  because,  if  a  sound  is  uttered 
which  has  the  distinctive  characteristics  of  two  phonemes,  the  listener 
would  bi;  unable  to  decide  which  phoneme  the  sound  belongs  to. 

The  .speakers  of  this  language,  then,  must  be  capable  not  only  of 
the  articulatory  adjustments  which  produce  a  sound  containing  the  distinc¬ 
tive  characteristic,  but  also  of  the  articulatory  movements  which  will 
produce  a  sound  lacking  the  distinctive  characteristic.  The  speaker 
must  be  able  to  place  hi.s  tongue  in  position  for  producing  not  only  a 
der.tu^i  sound,  but  also  one  with  some  other  place  of  articulation.  He 
must  be  able  to  produce  not  only  a  nasal  sound,  but  one  which  is  not 
nasal;  not  only  a  voiceless  sound,  but  one  whieli  is  not  voiceless;  not  only 
a  spUanI,  but  also  a  sound  with  some  other  manner  of  articulation.  It 
should  be  noted  that  most  of  these  non-distinet  ive  articulatory  positions 
permit  much  more  variation  than  the  distinctive  ones  for  this  ease.  A 
dental  sound  must  be  made  at  the  teeth,  but  a  non-dental  one  may  be 
made  anywhere  else  in  the  mouth  cavity.  The  non-spirant  sound  also 
has  a  large  range  of  permissible  variations.  Kven  the  non-voittdess 
sound  p<.'rmits  some  freedom,  sinctr  in  addition  to  normal  voicing,  the 
vocal  flaps  can  al.so  be  plaettd  in  the  position  for  laryjigt’aliiiation  and 
trillisalion.  Tlur  only  articulatory  position  with  no  freedom  is  tlu' 
non-nasal;  tlu'  only  way  lo  produce  a  non-nasal  sound  is  to  liave  llie 
velum  eoinpletely  elcjsed. 

Tile  above  described  .system  is  exlreimdy  inefficient.  Tlie 
speakers  o!  tliis  Uiiiguag<  are  forced  lo  di se ri minute  between  tlie  presence 
and  absen.ee  of  lour  cliarael e risi ics,  but  lliey  have  only  four  consonants. 

If  lliey  (  oinbined  Ibe  distiiu  Live  »  liarae te rist ic s  ,  tliey  could  liave  m.iny 
more  i  onsonant  plioiieiiies  without  having  lo  learn  to  produce  or  to 
recogni/.e  .my  more  distinctive  artieulalions.  In  theory,  they  i:ould  liave 
.sixleen  nlioneiiies,  lull  in  |>riietiec  tlie  number  would  be  less.  A  voiceless 
nas.il  spirant,  wlullier  dental  or  non-dental,  would  be  diffieull  for  the 
listeners  lo  identity,  and  tlie  speaker  would  frequently  be  asked  lo  repeat 
iiis  words,  biiire  |)eo[jle  do  not  like  to  make  extra  effort,  we  would  not 
expect  any  lang'O'g.’  n.  Imve  sin  b  a  iihoneme.  Thei'i.:  ale  oilier  pmssible 
combinations  of  ihesi*  etis  ivoUoristics  suc  h  as  voiceless  nasals  which 
are  not  optimally  audible.  leven  il  our  hypotlielicai  language  fails  to  use 
any  combinations  of  nasal  with  spirant  or  nasal  with  voic'eless,  it  can  .still 


I  3J 


make  ten  phoiemes  with  the  remaining  combinations!  This  is  far 
better  than  the  original  four,  and  the  speakers  and  listeners  are  not 
required  to  make  any  new  discriminations.  Combining  distinctive  features 
to  form  phonemes  is  a  very  important  method  of  reducing  the  exertions 
of  those  who  use  the  language. 

The  term  "distinctive  features"  as  used  herein,  it  should  be  noted, 
does  not  have  exactly  the  same  meaning  as  it  does  when  it  is  used  by 
Roman  Jakobson.  The  concept  of  distinctive  features  originated  in  the 
Linguistic  Circle  of  Prague  in  the  1930's.  Both  Jakobson  and  Martinet 
were  members  of  this  circle,  but  their  Ideas  have  developed  along 
different  lines  since  then.  Jakobson  maintains  that  there  are  only  twelve 
distinctive  features  in  all  of  the  languages  of  the  word,  while  Martinet 
argues  that  although  the  total  number  of  distinctive  features  in  any 
language  is  quite  small,  the  total  number  of  distinctive  features  which 
can  be  produced  and  recognized  by  human  beings  and  which  therefore  may 
occur  in  some  language  is  much  larger  than  twelve.  Moreover,  Jakobson 
says  that  all  oppusilious  are  binary,  while  Martinet  maintains  that  some 
oppositions  are  binary  and  some  are  not.  Jakobson  believes  that  the 
same  distinctive  feature  can  mark  both  the  vowels  and  the  consonants 
of  the  same  language,  while  Martinet  rejects  this  analysis  because  it 
would  require  too  much  precision  on  the  part  of  the  speaker.  If  vowels 
and  consonants  were  marked  by  the  same  distinctive  featares,  the 
speakers  of  the  language  would  have  to  take  care  that  the  feature  was  not 
accidentally  extended  to  an  adjacent  phone.  If  vowels  and  consonants 
are  marked  by  different  features,  then  the  extension  of  a  consonant¬ 
marking  feature  to  an  adjacent  vowel  does  not  interfere  with  com¬ 
munication. 

Combining  distinctive  features  to  form  phoncMiius  saves  exertions 
because  it  requires  a  smaller  number  of  distinctive  articulations,  but 
it  frequently  requires  greater  precision  for  the  articulation  of  some 
sounds  than  it  does  for  others.  Let  us  consider  a  language  which  has  four 
different  jaw  positions  that  combine  with  distinctive  tongue  articulations 
to  mark  the  vowels.  If  this  language  lias  four  front  vowels  and  four  back 
vowels,  the  speaker  will  have  to  be  more  careful  in  articulating  the  back 
vowels  than  the  front  ones  because  although  the  jaw  positions  are  the 
same,  the  tongue  positions  are  elower  together  for  tlie  different  back 
vowels  than  they  are  for  the  different  front  vowels.  There  is  not  as  much 
vertical  room  in  the  liack  of  the  mouth  as  there  is  in  the  front,  and 
therefore  it  takes  much  more  precision  to  make  three  distinct  levels  in 
the  back  of  the  mouth.  The  speakers  ot  tlie  language  may  feci  that  the 
precision  required  involves  loo  much  effort,  and  they  may  make  some 
change  which  reduces  the  number  of  back  vowels  to  three. 


The  speakers  of  such  a  language  may  also  try  to  simplify  the 
tongue  position  for  the  back  vowels.  In  so  doing  they  may  actually 
complicate  the  phonemes  of  the  language. 

The  fact  that  non-intcgrated  phonemes  have  more  room  for 
variation  than  integrated  ones  sometimes  leads  to  the  non-integrated 
becoming' integrated.  As  the  non-integrated  phonemes  vary,  sooner  or 
later  some  realizations  of  the  phoneme  may  have  a  phonetic  shape  which 
makes  them  part  of  the  already  existing  pattern.  This  happens  when  the 
language  does  not  already  have  phonemes  utilizing  all  practical  com¬ 
binations  of  distinctive  features. 

Let  us  consider  the  actual  mechanism  of  sound  change.  We  must 
always  bear  in  mind  that  the  principal  difficulty  in  speech  articulation  is 
to  produce  just  those  sounds  which  are  called  for  in  a  given  context.  A 
babbling  baby  can  produce  almost  all  the  sounds  of  any  human  language, 
as  well  as  some  sounds  which  do  not  occur  in  any  language,  but  he  cannot 
control  the  production  of  those  counds.  Great  exertion  is  always  easier 
than  precision,  and  perfect  control  over  all  the  articulatory  movements  is 
impossible.  If  tow  speech  sounds  are  acoustically  identical,  this  is  an 
accident,  and  it  is  an  accident  which  very  rarely  happens. 

In  actual  practice  two  phonetic  realizations  of  the  same  phoneme 
may  be  quite  dissimilar;  the  one  point  tliey  have  in  common  is  that  both 
lie  within  the  normal  range  of  the  phoneme.  The  speakers  aim  for  the 
"center  of  gravity,  "  but  they  frequently  go  wide  of  the  mark.  If  they 
go  wide  of  the  mark  and  fall  too  close  to  the  "center  of  gravity"  of  another 
phoneme,  the  speaker  has  to  stop  and  correct  himself;  but  if  they  go 
<!qually  wide  of  the  mark  in  sonic  direction  wlieie  there  is  no  other 
plioneme,  this  does  not  inlorforc  with  communication  and  may  not  even  bo 
noticed.  In  order  to  reduce  the  amount  of  precision  required,  languages 
leave  a  margin  of  .safety,  a  "no  man's  land"  between  plionomes. 

Martinet  beliove.s  tlnit  many  sound  cJianges  i:an  be  explained  by  the 
tendency  of  a  language  to  maintain  or  increase  its  margin  of  safely.  Lf  wc 
have  three  plioneme.s  which  are  seiix  rated  by  equal  niargins  of  safety,  and 
one  begins  to  change  and  approach  another,  it  may  set  off  a  chain 
reaction.  In  this  particular  case,  tlie  tendency  to  maintain  the  margin 
of  safety  is  not  as  powerful  as  tlie  force  causing  the  change.  The 
hypothetical  case  looks  like  this: 

B  A— >  C 

In  thi.s  situation  C  must  either  change  by  moving  further  away  from  A 
or'tiiere  will  be  a  phonemic  merger  and  some  words  which  were  formerly 


1  35 


distinct  will  become  homonyms.  If  there  is  another  place  which  is  easily 
available  to  C  or  if  the  merger  of  A  and  C  would  break  down  the  phonemic 
distinctions  between  a  very  large  number  of  words  which  have  the  same 
general  distribution  (i.  e.  ,  if  the  distinction  of  A  and  C  has  a  high 
functional  yield),  C  will  change.  K  the  distinction  between  A  and  C 
has  a  low  functional  yield,  and  if  there  is  no  place  for  C  to  go,  there  will 
be  a  merger. 

As  A  moves  away  from  B,  this  widens  B's  margin  of  safety  with 
A.  If  B's  margins  of  safety  are  narrow  in  all  other  directions,  B  may 
move  towards  the  spot  formerly  occupied  by  A  to  widen  its  margins  of 
safety  with  its  neighbors.  If  these  neighbors  are  also  crowded,  they  may 
take  advantage  of  the  extra  apace,  and  there  is  a  general  shift  which 
affects  a  large  part  of  the  system. 

In  the  above  discussion  we  have  spoken  as  if  A  and  B  had  an 
existence  independent  of  the  speakers  of  the  language.  This,  of  course, 
is  not  true.  When  we  say  that  C  moves  in  order  to  avoid  a  merger  with 
A,  we  mean  that  the  speakers  of  the  language  favor  those  variants  of  C 
which  are  a  safe  distance  removed  from  the  "center  of  gravity"  of  A. 

This  results  in  a  shift  of  the  "center  of  gravity"  of  C. 

If  the  merger  of  A  and  C  would  result  in  a  few  cases  of  intolerable 
homonymy,  the  existence  of  these  few  cases  does  not  prevent  merger, 
but  one  member  of  the  homonymous  pair  drops  out  of  the  language,  and 
it  is  replaced  by  another  word  of  similar  moaning.  A  sound-change 
which  took  place  in  southwestern  Franco  resulted  in  the  words  for  'cat' 
and  'rooster'  being  ide,'iti.'’al.  In  a  farming  community,  it  is  necessary 
to  have  a  distinct  name  ( >r  each  domestic  animal,  so  the  old  word  for 
'rooster'  dropped  out  of  the  language,  and  was  replaced  by  a  word  which 
had  formerly  meant  'pheasant'.  When  the  speakers  of  a  language  are 
faced  by  situations  of  this  typo,  they  usually  find  a  way  out. 

The  fact  that  a  contrast  between  two  given  phonemes  has  a  high 
functional  yield  (many  minimal  pairs)  means  that  a  merger  is  extremely 
unlikely;  but  two  phonemes  will  not  necessarily  merge  when  their  opposition 
has  a  low  functional  yield  .(few  minimal  pairs).  There  are  very  few 
minimal  pairs  (pairs  of  words  which  are  identical  except  that  one  member 
of  the  pair  has  phoneme  A  where  the  other  ha.s  phoneme  B)  for  the  English 
phonemes  ffS  !  and  /p/  (or  /©/).  The  most  commonly  cited  pair  is  thy 
and  thigh  ,  but  it  is  extremely  difficult  to  think  of  a  context  in  which 
either  of  these  is  equally  probable.  Yet  Martinet  does  rr-l  believe  that 
these  phonemes  will  merge,  because  the  voiced  feature  is  supplemented 
by  accompanying  variations  in  strength  of  articulation,  In  thi^/nanner 
ft  is  distinguished  from  p  ,-v  from  f  ,  2_froms,  J^froms,  c  from 
4  ■  a.nd  ^  is  separated  from  p,  £  from  t  ,  and  from  Ic.  The  voiced- 


voiceless  opposition  greatly  helps  to  stabilize  the  consonant  pattern  of 
English,  He  cautions,  however,  that  the  possibility  of  phonetic  change 
is  not  precluded  by  such  an  opposition,  but  concludes  that  a  merger  of 
sounds  is  less  likely  than  if  only  one  pair  of  consonants  in  the  language 
were  opposed  (Martinet,  1952). 

POSSIBLE  APPUCATIONS  OF  MARTINET'S  WORK  TO  OUR  MODEL 

One  appect  of  the  potential  relevance  of  Martinet's  theories  to 
English  may  be  considered  in  the  light  of  the  chart  shown  in  Figure  17, 
which  shows  the  stops,  nasals,  and  spirants  of  English.  The  chart 
includes  all  the  dimensions  which  were  shown  in  the  charts  of  our  model, 
but  we  have  arranged  them  differently  in  order  to  graph  them  on  a 
single  piece  of  paper. 

According  to  Martinet,  /f/  and  / v/  can  be  considered  to  have  the 
same  phonemic  place  of  articulation;  be  feels  that  a  bilabial  spirant 
would  be  very  weak  and  diffictuliJ To  recognize  and  therefore  the  speakers 
of  the  language  have  substituted  a  labiodental  spirant.  The  distinctive 
feature,  then,  is  labial  articulation,  which  must  involve  the  lower  lip, 
while  the  oilier  articulator  is  either  the  upper  lip  or  the  teeth.  Martinet 
does  not  discuss  whether  the  dental  spirants  can  be  considered  to  have 
the  same  place  of  articulation  as  llie  alveolar  stops.  The  same  argument 
wliich  was  used  Ioi‘  HI  and  /v/  may  apply  here  also.  An  alveolar  spirant 
is  not  easy  to  identify,  so  it  is  quite  reasonable  that  the  speakers  of  the 
language  should  substitute  a  dental  spirant.  The  distinctive  feature  would 
be  apical  (longue-lip)  articulation  against  eitlier  the  leetli  or  the  alveolar 
ridge.  We  assume,  then,  that  all  four  spirants  are  well-integrated  into 
the  plionemie  pallern.  It  is  interesting  to  note,  however,  that  there  iiru 
no  spirants  corresponding  to  tlie  guttural  slops.  This  moans  lla  I  the 
spe.'il^ers  do  not  have  to  he  as  careful  to  make  a  complete  stop  closure 
foj'  /k/  and  /g/  as  they  iiuisl  be  for  llie  labial  and  apical  stops.  It  is 
))OSSible  llial  sonu’iimes  they  siibslitule  a  spirant  for  a  guttural  slop. 

In  his  description  of  the  phonetics  of  American  English,  C.  K.  Thomas 
('llioinas,  Fid?)  gives  one  example  of  a  word  which  formerly  always  had 
a[k]und  now  sometimes  has  aL,)  ■  (L'il  “t-  spirant  with  approximately 
the  same  pl,aei^  of  articulation  as  the  [  k  j  of  key.  )  The  word  technical 
is  i>roiumni  ed  bv  soni<-  Americans  with  a  spirant  insleatl  of  afk]  before 
the  [_n3  . 

MaiTiiiet's  theories  may  also  be  relevant  in  considering  llie 
;?rob!ei;i  of  /!/.  There  is  only  one  lateral  phoneme  in  English;  tins 
means  tliat  the  sound  pallern  does  not  require  that  /!/  liave  a  certain 
place  of  articulation  or  liiat  it  be  voiced  rather  tlian  voiceless.  WTien 
we  consider  struc  lure  of  the  articulators  and  the  mouth  cavity,  it 


1  37 


becomes  obvious  that  there  are  physiological  constraints  on  the  production 
of  a  lateral.  It  is  impossible  to  make  a  lateral  with  the  lips  or  the  teeth, 
and  this  immediately  excludes  bilabial  and  labiodental  articulation.  As 
far  as  voiceless  laterals  are  concesned,  we  should  bear  in  mind  that  a 
voiceless  lateral  is  realtavely  difficult  for  the  listeners  to  recognize, 
although  a  few  languages,  such  as  Welsh,  are  reported  to  have  voice¬ 
less  lateral  phonemes.  We  may  expect  some  lateral  articulations  to 
be  voiceless  or  partly  voiceless,  but  not  very  many.  The  environments 
in  which  voiceless  laterals  are  most  likely  to  occur  are  after  rp]  and 
after  [kj  ,  in  such  words  as  play  and  clay,  and  the  voiceless  lateral  in 
these  environments  ia  apparently  the  result  of  a  coarticulation,  The 
tongue  is  in  position  for. the  lateral  before  the  stop  is  released.  Usually 
there  is  a  short  period  of  iioiceless  lateral  followed  by  a  short  period 
of  voiced  lateral. 

The  place  of  articulation  of  the  laterals  is  a  far  more  complex 
problem  because  there  are  more  variations  involved.  In  an  earlier  report, 
we  mentioned  the  fact  that  /l./  couid  show  a  great  deal  of  phonetic  variation, 
but  at  that  time  we  believed  that  in  a  particular  environment  there  would 
be  little  variation.  We  knew  that  the  lateral  most  commonly  occurring 
in  bulk  was  quite  different  from  the  lateral  most  commonly  occurring  in 
lane  ,  but  we  assumed  that  the  lateral  of  bulk  did  not  vary  much  from  one 
pronunciation  of  the  word  to  another.  If  Martinet's  theories  are  correct, 
it  may  become  necessary  to  question  this  assumption.  It  seems  possible 
that  the  /!/  of  bulk  shows  phonetic  variation.  The  /!/  of  this  word 
probably  has  a  guttural  place  of  articulation  more  frequently  than  any 
other  place,  because  it  lies  between  a  vowel  and  a  /k/,  both  of  which 
must  he  articulated  precisely,  and  the  lateral  which  requires  the  least 
exertion  in  this  situation  is  a  guttural  one;  but  if  the  speaker  feels  that 
the  guttural  lateral  is  not  distinct  enough,  he  may  shift  the  place  of 
articulation  forward  in  his  mouth,  either  to  the  palatal  or  to  the  alveolar 
position.  Our  model  must  be  specifically  instructed  to  expect  these 
variations. 

In  general  we  would  expect  a  guttural  lateral  before  a  guttural  and 
an  alveolar  lateral  before  an  alveolar.  We  expect  little  variation  in  the 
place  of  articulation  of  a  lateral  before  an  alveolar,  because  alvedlar 
articulation  is  the  clearest,  and  in  this  environment  it  is  also  the  easiest. 

In  our  acoustic  research,  we  will  investigate  the  problem  of  which  lateral 
is  most  common  before  a  labial. 

There  is  probably  little  phonetic  variation  in  an  initial  /!/  followed 
by  a  specific  vowel,  and  since  we  treat  initial  consonants  and  following 
vowels  as  a  unit,  the  phonetic  difference  between  the  /!/  of  lay  and  the 
/!/  of  low  should  pose  no  problems  for  the  machine. 


The  final  /l/'s  of  peel  and  pool  have  somewhat  more  freedom  to 
vary,  but  since  this  variation  will  always  be  caused  by  the  conflicts  of 
simplicity  and  clarity  the  variation  will  be  from  the  clearest  articulation, 
which  is  alveolar,  back  towards  the  easiest,  which  will  probably  be 
palatalt  In  our  acoustic  research,  we  will  seek  to  establish  the  easiest 
place  of  articulation  for  /!/  after  different  vowels. 

Elsewhere  in  this  report,  we  have  given  a  list  of  sound-changes 
involving  1.  Judging  by  the  nature  of  the  changes,  it  seems  clear  that 
tliese  l‘s  are  all  dental  or  alveolar  except  where  they  are  specifically 
described  as  having  some  other  place  of  articulation.  These  rules  will 
apply  primarily  to  initial  /l/’s,  to  final  /l/*s  after  front  vowels,  and  to 
/l/'s  before  alveolar  consonants,  since  these  are',  always  alveolar. 

We  have  discussed  the  laterals  in  detail  because  of  all  the  sounds 
which  we  have  included  in  our  charts,  these  are  the  least  "integrated." 

The  importance  of  tlie  dislinclive  features  is  that  when  they  are 
combined,  the  resulting  so\inds  form  a  pattern.  If  a  language  combines 
the  distinctive  features  of  t!)ree  places  of  articulation  (labial,  dental, 
and  guttural),  two  voi-al  flap  positions  (voiced  and  voiceless)  with  a  stop 
articulation,  there  are  six  possible  plu.>ni:ni(!.s ,  as  follows: 


Labuil 

DcMiUil 

Gutturai 

Voicel(‘Ss 

P 

1 

k 

Voiced 

b 

cl 

H 

Jiacli  of  lh<'se  phonemes  is  integrated  into  the  pattern;  it  is  th(;  product 
of  combining  a  number  of  different  distinctive  fealurt's. 

The  above  situation  contrasts  with  the  posUif)n  of  a  phoneme  such 
as  JOnglish  /!/.  The  distinctive  feattiri'  of  /!/  is  lateral  articulation,  and 
no  other  ]>lioneim!  has  this  feature.  None  of  the  stop  consonants  listtjcl  above 
has  any  feature  which  is  unique.  'I’his  characteristic  of  /!/  makes 

it  cumplelety  iion-inlegraled,  while  the  slops  an.’  well  integrated.  English 
/!/  is  fn-e  lo  vary  phonetically,  situ  e  neither  tlu^  place  of  articulation 
nor  the  vocal  flap  adjustiiunil  are  phonemic  for  it. 

Martinet  believes  that  a  well  -integrated  plmnenu?  is  far  less  sub¬ 
ject  lo  individual  <  hange  than  a  nun  -  iniegraterl  one.  In  the  palturti  given 
above,  the  dental  /l/  is  not  likely  to  become  alveolar  while  everything 
else  rciiiciins  llu;  same.  If  tliis  iiappeiied,  the  number  of  places  of 
articulation  would  be  increased  wiLliuut  increasing  the  number  of  phonemes, 
'i'here  would  be  an  increase  in  exertion  wliieli  w(>uld  nut  lie  compensated 
for  elst’where,  either  by  a  reduelion  iji  exertion  or  by  an  increase  in 
clarity.  It  is  quite  possibb?,  how<>ver,  that  /l/  iind  /d/  might  both  change 
llicir  plaei*  of  articulation  and  bi’cume  alveolar  stops.  This  would  involve 
no  I  hange  in  tlie  total  nuinl)er  of  di.stinctive  features. 


1  39 


One  particular  limitation  to  Martinet's  work  is  inherent  in  the 
present  development  oi  phonemic  theory  itself.  Researchers  in  the  field 
of  phonemics  assume  a  direct  relationship  between  distinctive  features 
on  one  hand  and  resonance,  place  of  articulation  and  manner  of  articulation 
on  the  other.  Phonemic  research  from  its  early  inception  had  defined 
phonemes  in  terms  of  resnnnnoe,  place  and  manner  of  articulation. 
Researchers  such  as  Martinet,  Jakobson,  and  Halie  describd  the  phonemes 
of  a  language  as  having  distinctive  features  that  identify  them  from  all 
other  phonemes  in  a  language.  The  distinctive  features  of  a  given  phoneme, 
however,  do  not  distinguish  between  possible  allophones.  Hence  in 
Martinet's  theories  the  distinctive  features  for  k  as  in  cook  are  identical 
with  those  of  Ic  as  in  kit  ,  and  by  definition  identical  with  the  dflstinctive 
features  of  any  other  allophone  of  k.  This  poses  a  problem  in  transcribing 
acoustic  data,  since  the  phoneme  k  is  described  to  include  at  least 
four  different  (but  not  distinctive)  places  of  articulation  in  English  speech, 
and  the  k  of  cook  has  a  different  place  of  articulation  from  the  ^  of  kit . 

Such  a  situation  indicates  very  strongly  that  distinctive  features  do  not 
characterize  with  necessary  precision  the  resonances  or  place  and  manner 
of  articulation  of  speech  segments.  This  lack  of  precision  makes  ii 
extremely  difficult  to  interpret  meaningfully  the  relationship  between 
distinctive  features  of  "phoneme"  and  the  "center  of  gravity  of  its  place 
of  articulation.  " 

An  added  problem  in  the  use  of  distinctive  features  of  descriptions 
of  speech  segments  is  the  lack  of  precise  relationship  between  these 
features  and  the  acoustic  characteristic  of  wave  forms  of  speech.  Such 
wavefforms,  essential  to  the  data  included  in  our  model,  depend  on  the 
precise  definition  of  resonance,  place  and  manner  of  articulation;  as 
indicated  above,  distinctive  features  lack  such  precise  definitions. 

Another  important  omission  is  th.at  Martinet  devotes  little  attention 
to  the  problem  of  the  coloration  which  vowels  give  to  preceding  and 
following  consonants.  Our  model  includes  a  special  subdivision  for 
this  subject  particularly  because  articulation  of  a  consonant  is  likely  to 
be  strongly  influenced  by  the  preparation  which  a  speaker  must  make 
for  his  following  vowels.  The  mechanics  of  sound  change  or  variation 
in  rapid  speech  are  also  likely  to  be  Influenced  by  what  particular  com¬ 
binations  of  vowels  and  consonants  occur  together.  In  the  case  of 
sit  down  there  is  a  definite  transformation  of  the  unvoiced  dental  stop, 
but  in  the  case  of  took  good  there  is  less  likely  to  be  a  transformation 
because  the  cluster  occurs  between  identical  vowel  sounds.  Such 
problems  deserve  further  attention. 


140 


An  additional  problem  Is  that  Martinet^s  rules  of  eound  change 
are  at  best  only  preferential;  they  have  occasionally  been  observed  to 
conflict  with  changes  in  the  languages  which  are  assumed  to  be  relevant 
to  English  under  the  premises  of  the  Ergodic  theory  and  which  in  some 
cases  have  an  observed  relationship  to  sound  changes  actually  taking 
place  in  rapid  speech.  Appendix  H.  1  cites  several  examples  of  sound 
changes  relevant  to  English  that  cross  the  boundaries  of  distinctive 
features.  Among  them  are  the  gradual  shift  of  a  gutteral  stop  plus 
f- to  ff  (as  in  big  fire),  the  shift  of  a  dental  stop  plus  a  gutteral  stop 
to  a  gutteral  atop  (at  camp  ),  and  the  transformation  of  a  dental  stop 
before  a  labial  stop  to  a  labial  stop  (at  bay).  In  all  similar  cases 
physical  convenience  in  pronouncing  the  words  may  transcend  the 
importance  of  dislinctivt*  features. 

The  inability  of  Martinet's  theories  to  ac'count  for  the  sound 
changes  discussed  above  argues  that  distinctvvo  features  at  best  have  only 
limited  relevance  to  our  investigations.  Diinensj ons  of  our  model  arc 
constructed  nv  as  to  account  for  many  of  the  problems  already  disc.usscd. 
By  treating  <.onsonunt  vowtUs  as  one  unit,  we  rccogni/^e  the  co-articulation 
of  ])lione.s  and  avoid  the  qu(?sliun  of  diffta'cnl  places  of  articulation  for  the 
same  consonant  t  nusc<l  by  vowel  eolovalion.  T^y  rjieasii ring  duration  we 
are  able  to  analyz.‘  the  process  wlueli  r.u^  result  in  vowel  eoloratior, 
cljauges  in  transition-resulting  in  the  arlieulatiun  of  a  seml-vowol, 
diffe r(‘jic<?  in  the  target  value  ol  form. nits,  and  change  in  the  eliaracter- 
isti<'s  of  a  sound  Ihrougli  inereased  or  lesseiurd  einpliasis.  By  devising 
our  new  system  of  inlensiJy  measurement,  already  diseussed,  circi 
al)U  lo  unaly/ur  in  gnrater  detail  the  inClm  nee  of  j)rei  luling  and  following 
sound  units  upon  acoustii  <U-linition  of  a  t'iven  phoneme  while  providing 
further  inlormalion  on  tlii-  effects  of  eiujiltasis. 

A<ldition<il  miportaol  iiiformation  m  our  m  >del  is  general.iui  by 
siU’Cifie  invesUgution  into  the  rules  id  euphinxii'  eumhinalion.  When  a 
group  of  tile  allophoni-s  of  any  given  plioiumu  i:omhiiu'  in  speecli  theiJ.' 
i  ti.iraeterislies  may  at  times  be  modifieit  to  so  giU'al  an  extent  Ihat 
tlieir  a<.oustie  sigiuils  may  he  identifii'd  as  belonging  to  another  distiiu't 
ptioiieme  williin  the  language;  laT  you  becomes  beeliyon;  at  other  limes 
phonelic  eombinatlons  result  in  tuLa)  eluniuallon  of.'  the  signal  for  .some 
phom-  clas.ses,  «•.  g.  hot  day  liecomes  tiuday.  All  such  aeouslic  shifts 
foriM  an  I's.seiiual  and  inlegraled  basis  fi.>r  the  delinition  of  spi'eeli. 

Any  description  of  speei  lx  winch  does  not  consider  such  aspeets 
may  not  be  adeipiate  to  describe  the  acoustic  cini iMc le  ristic  of  speijiu’h 
signals  or  to  deiine  a  speei  h  si*gmer.ling  systenx  suilabh-  fnr  actuating 
electronic  machinery  that  can  reuogtxi’/a'  speech  by  sui.h  characteristics. 


141 


Despite  present  limitations  in  data,  however,  the  scope  of  our  model  suggests 
that  distinctive  features  do  not  provide  sufficiently  full  analysis  of  sound 
change  for  exclusive  incorporation  into  our  data. 


142 


APPENDIX  H 


RULES  OF  SOUND  CHANGE  AND  EUPHONIC  COMBINATION 

H.  I 

CHANGES  IN  PLACE  OF  ARTICULATION 

1.  In  guttural  stops  the  following  articulatory  changes  can  occur: 

(a)  A  voiceless  guttural  stop  +^becomes  ss  (Ionic  Greek)  or  tt 
(Attic  Greek);  the  second  change  may  happen  in  cute  kyut  >  but 
this  is  not  likely, 

(b)  A  voiced  guttural  stop  +_^  becomes  (Greek);  tliis  does  not 
happen  in  English, 

(c)  A  guttural  stop  becomes  first  a  labial  stop  +_f,  then  becomes 
ff  (Latin),  This  may  happen  to  such  combinations  as  big  fire  in 
rapid  speech, 

(d)  The  sounds  J5 +_t  become  tt  (Vulgar  Latin);  this  may  happen  in 
active  but  evidence  seems  to  contradict  it, 

2,  In  dental  stops  the  following  changes  occur  (English  has  no  dental 

stops,  but  these  rules  may  also  refer  to  aveolar  stops); 

(a)  A  dental  stop  before  a  labial  stop  becomes  a  labial  stop  (Latin); 
(his  may  liappen  in  at  bayi 

(b)  A  dental  stop  before  a  guttural  stop  becomes  a  guttural  stop 
(T,atin),  possibly  true  of  at  camp. 

(c)  A  voiceless  dental  stop  (  jr  becomes  s  or  (Greek);  this  does 
not  happen  in  English. 

(d)  A  voiced  dental  stop  +jjr  becomes _2d  (Greek);  this  also  does  not 
happen  in  Englirh, 


1-4  3 


(e)  A  voicolese  dental  stop  +jr  become  a  _c_in  EagHsh;  at  yon  becomes 
a  choo« 

(f)  A  voiced  dental  stop  becomes  J_;  did  yon  becomes  di  joo, 

(g)  A  dental  stop  +  m  becomes  mm  (Latin),  This  may  happen  in  EngUsh 
in  the  word  atmosphere,  but  there  is  an  important  diiference 
between  the  English  example  and  the  Latin.  In  English  the 
difference  between  single  and  double  consonant  is  not  phonemic 

(not  word<«dif£erentiating);  single  and  double  are  probably  in  free 
variation  in  those  positions  where  other  languages  have  only 
double  consonants. 

(h)  A  dental  stop  becomes  labial  stop  +  f,  then  becomes  ff  (Latin); 
in  English  this  change  would  not  necessarily  produce  two_f*8  for 
the  same  reason  that  atmosphere  does  not  necessarily  have  two 
m’si  at  five  is  a  possible  example, 

(i)  A  voiceless  dental  stop  preceded  by  becomes  s  +  a  voiceless 
lingual  stop  (Sanskrit);  English  has  no  Ungual  stops. 

(j)  A  Ungual  stop  +  a  dental  stop  becomes  Ungual  stop  +  Ungual 
stop  (Sanskrit). 

3.  In  dental  nasals  the  following  changes  may  happen: 

(a)  An ji  +  guttural  stop  becomes  ^(ng)  +  guttural  stop  (Greek); 
income  nk  m  becomes  ^  m  . 

(b)  An  n  +  a  labial  stop  becomes  m  +  a  labial  stop  (Greek);  in  bed 
becomes  im  bed, 

(c)  The  soundsji  3-  j  become  nj  (Sanskrit),  injure. 


144 


(d)  The  palatal  atop  becomes  palatal  stop  (Sanskrit);  possibly 
this  occurs  in  such  phrases  as  change  now. 

(e)  An  alreolar  nasal ji  +  m  becomes_m  or  mm  in  English;  ten  minutes  becomes 
te  minute  8. 

4,  In  dental  sibilants  the  following  changes  can  take  place: 

(a)  An  intervocalic  sr  becomes  then  becomes  br  (Latin);  this 
probably  doesn’t  occur  in  English, 

(b)  The  sounds _z_  +  jn  become  mm  (Germanic);  possibly  this  happens 
in  is  mine, 

(c)  A  palatal  +  s  becomes  s  (Sanskrit),  possibly  in  church  steepld, 

5,  In  labial  stops  tlierc  are  the  following  possible  changes; 

(a)  A  labial  stop  +  a  guttural  stop  becomes  a  double  guttural  stop 
(Latin);  tliis  could  happen  in  up  comitry,  though  tlie  stop  might 
not  necessarily  be  double. 

(b)  The  sounds  j)  +_t  become  ^(Vulgar  Latin);  this  might  happen  in 
up  to,  altliough  again  the  sound  may  not  be  genriinate, 

6,  The  following  changes  take  place  in  labial  nasals: 

(a)  A  final  m  becomes  n  (Greek,  Old  English);  possible  in  English, 
but  not  probable. 

(b)  The  soiinds  m  +  d  become  nd  (Germanic),  possible  in  am  down . 

(c)  The  sounds  m  +  s^  become  s  (Latin),  possible  in  am  so, 

(d)  A  final  m  and  non-labial  stop  become  homorganic  nasal  +  stop 
(Sanskrit),  possible  in  am  going. 


145 


7>  The  sotui(l2.bocomes  guttural  or  palatal  in  Latin  and  palatal  In 
English; 

(a)  whan  followed  by  a  back  rowel,  low; 

(b)  when  followed  by  a  consonant  (except builtj 

(c)  when  it  occurs  at  the  end  of  a  word,  ball. 


146 


APPENDIX  iiH.  XI 


CHANGES  IN  MANNER  OF  ARTICULATION 

A,  Changes  JjavolYing  Stops 

1.  Chaonges  from  stop  to  spirant: 

(a)  In  Germanic  p  becomes  f  everywhere  except  after  a  spirant  or 
sibilant.  In  English,  speakers  may  sometimes  fail  to  make  a  com¬ 
plete  stop  closure,  but  we  would  expect  the  bilabial  spirant 
rather  than  a  labiodental  spirant. 

(b)  In  Germanic  _t  becomes  everywhere  except  after  s  or  spirant. 
Again,  in  English  a  speaker  might  fail  to  make  a  complete  closure, 
but  wc  would  expect  an  alveolar  spirant  rather  than  a  labiodental 
one. 

(c)  Iir  Germanic  k  becomes  a  voiceless  guttural  spirant  everywhere 
(sxcept  after  a  stop  or  spirant.  In  American  English  some  speakers 
substitute  a  spirant  for  a  stop  in  the  word  tochmeal, 

(d)  In  Modern  Greek,  all  voiced  unaspirated  stops  become  spirants. 

Wo  expect  that  this  hapjjens  sometimes  in  Modern  English;  bo 
may  sometimes  be  pronounced  witli  a 

Z,  Changes  from  stop  to  sibilant: 

(a)  In  French  k  before  front  vowels  becomes  In  Litliuanian  Proto- 
Indo-European  k  becomes  Uiis  happens  in  Jdngllsh,  it 

affects  words  like  kick,  which  may  become  sick  or  shick,  but  we 


[/3] 


instead  of  a  b. 


do  not  expect  this  to  happen. 


147 


(b)  In  Latin  and  GermanlCi  Proto-Indo-European  bctcomes  If 
this  happens  in  English,  phrases  like  at  two  ma.y  become  as  sue, 
but  we  do  not  think  that  this  occurs. 

B.  Changes  InTolving  Spirants' 

1.  Change  from  spirant  to  stop; 

In  Swedish  [p]  becomes  t.  This  happens  sometimes  in  English, 

It  affects  words  like  thing. 

2,  Change  from  spirant  to  h: 

In  Spanish  f  becomes  If  this  happens  in  English,  it  affects  words 
like  fine. 

C.  Changes  Involving  Hissing  Sibilants 
li  Non-comblnatory  changes; 

£  becomes  k  in  ancient  Greek  initially  and  between  vowels.  In 
English  see  may  become  he  if  it  is  weakly  articulated. 

(b)  £  becomes  r_  between  vowels  in  Latin  and  at  the  end  of  a  word  in 
Sanskrit  if  the  next  word  begins  with  any  voiced  sound  except  x, 

(This  change  probably  takes  place  in  two  steps.  First  £  becomes  _z, 
then  £  becomes  £,  )  We  would  not  expect  r  as  a  variant  of  £  in 
English,  although  £  may  become  £,  as  indicated  in  Appendix  H.  IIL 
(p,  153  ,  rules  1  and  2), 

(c)  z  becomes  r  in  very  early  Old  English  and  Old  Norse,  We  are  not 
sure  this  happens  in  English,  even  as  a  slip  of  Uie  tongue.  If  it 
should  happen,  easy  would  be  almost  homonymous  with  eerie. 


148 


Z,  Combinatory  changes: 

(a)  A  dental  stop  plus  becomes  _ss  in  Greek  and  Latin*  This 
probably  occurs  in  English  in  phrases  like  at  sea^ 

(b)  £k'  becomes  j(the  consonant  of  she) in  Old  English  and  Old  High 
German.  This  may  occur  in  English.  If  it  does,  it  would  make 
skip  identical  to  shipl 

(c)  In  Latin  sr  between  vowels  becomes  [jjrJ  (like  the  initial  cluster 
of  tliree)*  In  standard  American  English,  the  cluster  sr  docs  not 
occur  within  a  word,  but  b  is  probably  substituted  for  s_  some¬ 
times  in  phrases  like  less  rain. 

(d)  In  Germanic  zm  becomes  mm.  This  may  happen  in  phrases  like 
is  more. 


(e) 

(i) 


In  Latin  ^becomes  ^  This  may  happen  in  phrases  like  bus  fare. 
In  Modern  English  frequently  becomes and  becomes 
This  occurs  in  phrases  like  miss  you  and  as  you. 


D.  Changes  Involving  Laterals 
1.  Non-combinatory  changes: 

(a)  Changes  from  stop  to  lateral  articulation;  in  Latin  d  becomes 
In  Sanskrit  rctroflexed  ^  becomes  retroflcxed  This  may 
occur  as  a  slip  of  tlie  tongue  in  English:  dot  may  become  lot. 

(b)  Change  from  lateral  to  stop:  Li  some  French  dialects  J.1  becoiiioj 
t  In  final  position.  There  are  two  changes  involved  here  —  a 
clitunge  in  manner  of  articulation  and  a  change  in  resonances. 


149 


(c) 


(e) 


The  chaage  in  manner  of  axticulation  probably  took  place  first. 
We  do  not  expect -tius  change  to  take  place  in  final  position  in 
Fjiglish,  since  a  final  usually  does  not  have  the  same  place  of 
articulation  as  a  _t. 

Alternation  between  stop  and  lateral:  In  Greek  d  alternates  with 
h  This  means  that  cither  d  becomes  2.  ^  becomes  dl  We 

have  already  suggested  that  dot' may  be  mispronounced  as  lot;  it 
is  also  possible  that  lot’ is  mispronounced  as  dot. 

Change  from  lateral  to  sibilant:  in  Castilian  Spanish  becomes 
jj.  We  do  not  expect  this  to  happen  in  English, 

Changes  from  lateral  to  r  and  from  r  to  lateral: 

(1)  In  some  Sanskrit  dialects  _r^  becomes _1,  As  a  slip  of  the 
tongue,  right  may  be  pronounced  as  light, 

(2)  In  some  Sanskrit  dialects  _1  becomes  _r.  As  a  slip  of  the 
tongue,  light’  becomes  right.  Historically,  this  is  what 
happened  to  the  word  colonel,  which  was  once  pronounced 
with  an[l]  between  the _o_’b.  The  flj  became  [rj  but  the  old 
spelling  remained. 


2,  Combinatory  changes: 

(a)  Loss  of  1: 

(1)  1  drops  out  in  unaccented  syllable  before _t_in  Old  Icelandic, 
This  may  occur  in  English  in  words  like  belt, 

(2)  ^  drops  out  before  in  some  dialects  of  American  English. 
In  these  dialects  help  is  pronounced  [^hCpJ , 


150 


(b)  Loss  of  another  consonant  in  contact  with  1; 

(1)  In  Greek  1^  becomes  jl  and  the  preceding  vowel  is  lengthened. 
If  this  occurs  in  English  it  may  affect  words  like  else,'  but  as 
far  as  we  know  it  does  not  happen. 

(2)  In  Greek  ^  becomes  ^  and  tlie  preceding  vowel  is  lengthened. 
In  English  this  may  happen  in  such  phrases  as  ill  wind ,  but  it 
is  also  possible  that  the  remains  while  the  ^  drops, 

(3)  In  Latin  and  in  Old  Icelandic  initial  ^  becomes  _li  This 
cluster  does  not  occur  initially  in  English, 

(4)  In  Old  Icelandic  jil  becomes  _1  and  the  preceding  vowel  is 
lengthened  and  nasalisied.  This  probably  occurs  sometimes 
in  words  like  inlay. 

(5)  In  Modern  English  becomes  a  palatal  1,  as  in  the  word 
million.  This  palatal  1  sometimes  becomes  so  that 
million  is  pronounced  [iniyAii],  This  sound  change  commonly 
occurs  in  phrases  like  will  you, 

(c)  Addition  of  a  consonant  to  a  cluster  involving  h 

(1)  In  Latin  ml  becomes  mpl.  This  may  occur  In  sucli  English 
phrases  as  am  late,  although  the  rule  which  follows  probably 
applies  morn  frequently, 

(2)  In  Greek  ml  becomes  mbl.  This  probably  occurs  sometimes 
in  tile  phrase  am  late, 

(d)  Complete  assimilation  of  consonants  in  contact  with  2_‘- 

(1)  In  Greek  and  Latin  nl  becomes  1^  English  in  late  may 

1  hi 


become  11  late,  bat  English  consonant  durations  vary  freely. 


so  that  the  phrase  may  also  be  pronounced  1  late, 

(2)  In  Latin  and  Germanic  ^  becomes  lU  This  may  happen  in 
the  English  word  illness, 

(3)  In  Germanic,  becomes  U,  In  English  is  late  may  become 
11  late  or  1  late.  This  would  make  the  phrases  in  late  and 

is  late'  homonymous, 

(4)  hi  Latin  becomes  ^  If  this  happens  in  English,  else,’ 
would  become  ell,  but  this  probably  doesn't  occur, 

(5)  In  Latin  rl  becomes  This  may  occur  in  phrases  like  here 
later. 

(6)  In  Latin  dl  becomes  11,  This  probably  ocuurs  in  phrases  Eke 
bad  link, 

(7)  In  Sanskrit  tl  becomes  IL  This  probably  occurs  in  phrases 
like  at  last. 

(8)  In  Germanic  1  becomes  JL  This  may  occur  in  such  phrases 
as  with  luck, 

(9)  In  Greek,  becomes  In  English  usually  becomes  a 

palatal  J, 

(e)  Change  in  the  manner  of  articulation  of  another  consonant  in 
contact  with  1: 

In  Latin  v  becomes  b  whenever  it  was  preceded  by  J,  This 
might  occur  in  English  phrases  like  full  value. 


152 


APPENDIX  H.  in. 


RESONANCES 

A.  1,  A  voiceless  consonant  becomes  voiced  between  vowels: 

(a.)  when  it  occurs  at  the  end  of  a  word  and  the  next  word  begins 
with  a  vowel  (external  combinations  of  Sanskrit),  up  at; 

(b)  wheu_g,  _t,  or  jc  <occur  in  the  middle  of  a  word  (British  Celtic); 
the  difference  between  latter  and  ladder  is  in  the  vowel,  not 
in  tlie  consonants; 

(c)  when  £  occurs  between  vowels;  glassy  becomes  glazzy; 

2,  A  voiceless  consonant  becomes  voiced  next  to  a  voiced  consonant: 

(a)  when  a  voiceless  stop  or  e  occurs  before  a  voiced  stop  or 

spirant  (Latin),  up  the  becomes  ub  tlic; 

(■'•)  when  £  or  k  occur  before  tl  (Greek),  back  door  becomes 
bag  door; 

(c)  when  an  s  occurs  between  a  vowel  and  a  voiced  consonant 
(Latin),  glass  door  becomes  glaz  door; 

(d)  when  a  final  voic elc s s  consonant  precedes  a  word  beginning 
with  a  voiced  consonant  (Sanskrit),  Back  Bay  becomes  Bag  Bay^ 

3.  A  voiceless  spirant  or  sibilant  can  become  voiced  wiien  it  occurs 
between  voiced  phones,  except  when  the  preceding  vowel  is  accented 
(Proto-Gei’jnanic);  in  f.'nglish  we  have  Uie  same  situation  in  the 
phonetic  difference  between  exert  with  gz  and  exercise  with  ks  , 


I  5  j 


Ba  1.  A  voiced  conaoaant  caa  become  voiceless; 


(a)  ii  it  occurs  before  a  voiceless  stop,  spirant,  or  sibilant 
(Latin,  Greek),  big  town  may  become  bik  town; 

0>)  when  h  occurs  before  w  (some  American  dialects)  where; 

(c)  when  a  voiced  consonant  occurs  in  the  final  position  (Sanskrit  -« 
all  voiced  final  stops;  German  —  all  voiced  final  stops, 
apirants  and  sibilants).  In  English,  this  may  occur  at  the 
end  of  an  utterance;  where  is  the  bag  may  become  where  is 
the  bak.  (The  last  word  would  still  have  a  vowel  like  bag, 
rather  than  a  vowel  like  back.  ) 


1  54 


APPENDIX  H,  IV 


SOUND  DROP  OUTS 

A.  la  In  consonant  clusters  w  will  disappear  phonetically: 

(a)  before^  and  £  (Old  Icelandic,  Pre-Ijatin)j 

(b)  after £,  and  n_  while  lengthcfiUng  tlie  preceding  vowel 
in  some  dialects  (Greek),  bulwark  may  become  bulark; 

(c)  medially  before  any  consonant  (Old  Icelandic), 

2,  In  consonant  clusters  y  will  drop  out  after £_or  £,  while  lengthening 

the  preceding  vowel  (Greek),  Bunyan  may  become  Bunan, 

3,  In  consonant  clusters  i  will  drop  out: 

(a)  in  an  unaccented  syllable  before  _t  (Old  Icelandic),  belt  may 
possibly  become  bet  if  it  is  miaccented; 

(b)  before  an _£  (Greek),  Elsa  may  possibly  become  Esa,  but 
wc  know  of  no  instances  v/hure  tliis  has  actually  occurred, 

4,  In  consonant  clusters  s  disappears: 

(a)  bctwoeji  two  consonants  except  two  slops  witli  tlio  same  place 
of  articulation  (Creek),  pigsty  might  become  pigty; 

(b)  after  £  or  1,  Icngtlicning  the  preceding  vowel  (occasional 
example  in  Greek),  hearse  might  become  hCr  ,  This  change 
probably  docs  not  occur  in  Englisli; 

(c)  after  a  vowel  before  a  voiced  consonant  (Latin);  glass  door 
might  become  gla  door,  but  tlie  change  probably  doe,sn’t 

133 


occur  in  English 


5.  Although  English  has  no  dental  stops  we  have  provisionally  equated 

aveolar  drop-outs  with  the  dental  stop  drop-outs  that  occur  in 

consonant  clusters  oi  other  languages«  Dental  stqp  drop-outs  occur: 

(a)  before  after  n  (Germanic),  plenty  becomes  pleny; 

(b)  before  a  after  n  (Greek),  landscape  drops  the  d  in  some 
American  dialects; 

(c)  dental  stop  drop-outs  also  occur  when  an  initial  ^  precedes  jy 
(Latin).  General  American  does  not  have  this  consonant 
cluster  although  some  Southern  dialects  may,  as  in  due. 

In  tills  case  tlie  d  probably  does  not  drop,  however j 

(d)  In  English  a  proven  alveolar  drop-out  occurs  when  the  stop 
comes  between  jsand  s^j  thus  lasts  frequently  becomes  lass; 

(e)  an  alveolar  drop-out  also  occurs  sometimes  before_l  as  when 
little  becomes  lil« 

6.  Guttural  stops  disappear: 

(a)  after  £  before  a  consonant  (Germanic),  asked  becomes  astj 

(b)  after  r^  or  j,  and  before  _m,  or  n  (Latin),  bulks  (k  possibly 
dropped); 

(c)  when  aai  initial  occurs  before  ^  (Latin);  possibly  tliis  occurs 
in  such  English  words  as  argue,  but  probably  not, 

7.  Dental  and  guttural  phones  disappear  before  s  plus  a  consonant 

(Germanic);  tliis  may  occur  in  such  English  words  as  huckster. 


8. 


.  In  consonant  clusters  n  disappears: 

(a)  between  a  vowel  and  (La tin),  insert; 

(b)  after  a  vowel  before  £  or_l,  (Old  Icelandic),  inlay. 

9.  When  it  occurs  in  conSbnau>.t  clusters,  ^also  disappears; 

(a)  between  a  vowel  and  s  (Latin),  possibly  also  in  English 
combinations  such  as  come  soon; 

(b)  between  a  vowel  and^  (Latiu,  Old  Icelandic);  probably  this 
occurs  in  such  words  as  comfort. 

10.  The  sound  th  disappears  between  consonants  in  English,  as  in 
fifths. 


]',  111  positions  other  tlian  consonant  clusters  tlie  following  sounds  can 
drop  out.  Phonetic  symbols  suggest  how  this  occurs  in  most  cases 
as  an  aid  to  tlie  reader. 

1,  In  final  positions  these  sounds  disappear: 

(a)  ni  (Intin,  Scuiskrit,  Germanic),  _m  becomes  voiceless  nasal, 
then  drops  out  completely;  probably  tliis  docs  not  occur  in 
English; 

(U)  ii  (Germanic,  Old  English),  becomes  a  voiceless  nasal,  then 
drops  out  completely;  tliis  may  occur  witla  man; 

(c)  dental  stops  (Latin,  Germanic);  dental  stop  becomes  a  glottal 
stop,  tlien  drops  out  completely; 


I  bl 


(d)  the  alveolar  stop  t  ,  which  can  change  to  a  glottal  stop  or 


drop  completely  under  certain  circumstances  (Modern  English)* 
field  or  that  boy,  tlxough  probably  only  before  a  consonant. 

This  loss  of  alveolar  stops  has  been  described  by  many 
phoneticians,  among  tliem  C.  K,  Thomas  in  his  book  An 
Introduction  to  the  Phonetics  of  American  English  (p.  40); 

(e)  ^  (Latin,  when  it  occurs  after and  tlie  next  word  l^egins  with 
an  initial  consonant);  s  becomes  h,  then  drops  out  completely, 
or  possibly  £  becomes  £,  then  drops  out  completely;  if  this 
applies  in  English,  loose  connectioii  would  drop  its  £,  but 
evidence  seems  to  be  against  it, 

Z,  Between  vowels  these  sounds  i.liHajjpuar: 

(a)  ^  (01<1  Icelandi<'.)i  y  becomes _h_>  tlien  drops  out  completely; 
crying  may  he.  reduced  to  one,  sylLihlo; 

(b)  w  (Greek;  Latin  wJien  betwtnm  like  vowels  and  Lliu  second  is 
iinstr(!ssod);  rowing  may  he  rc<luced  to  one  syllable; 

(c)  li  (Greek),  j_i  drop.s  '‘ut  c.oinpli^tidy;  ii\  English  is  he  becomes  iz  e . 

3.  Tlie  following  initial  soiuids  an  likely  to  drop  out: 

(a)  ^  (Old  Icrltandie),  ^  liecouu'S  li,  then  drops  out  completoly; 
we  are  jjot  sure  tills  happens  in  English; 

(h)  w  (Greek);  w  drops  out  comple.tely;  wo  are  also  unsure  that  tliis 
happens  in  English; 

(e)  h  (Modern  Greek,  Modern  Eiiglisli  when  it  is  first  in  air  unstressed 
syllable),  ji  drops  out  eoinpletely;  liereLi  your  book  may 
possibly  lose  the  h  in  liei'e*s. 


168 


4,  The  sound  w  will  drop  out  particularly: 

(a)  before  a  strongly  rounded  vowel  (Old  Icelandiat;  wool  is  a 
possible  instance,  though  we  do  not  presently  believe  this 
happens  in  English; 

(b)  before  (Pre-Latin); 

(c)  before ji  (Germanic^) 

(d)  after  o  (Old  Icelandic);  probably  this  happens  in  English,  as 
in  low  window. 

5.  Before  a  high  front  voweljr  drops  out  (Old  Icelandic):  yield  is  an 

example,  but  this  probably  does  not  occur  in  English. 


15>9 


APPENDIX  V 


A.  CHANGES  INVOLVING  PHONES  WITH  A  SINGLE 
PLACE  OF  ARTICUIJITION 

A»  Changes  InYolVing  [yj  : 

Some  o£  the  rules  listed  here  have  been  given  elsewhere  also. 

AU  rules  concerning  ^  are  included  here  because  they  wiU  help  us  fit 
£  into  the  modeL 

1.  Non-combinatory  changes: 

(a)  Loss  of 

(1)  In  Old  Icelandic,  initial  ^  is. lost;  it  may  become  before  it 
drops  completely.  We  doubt  that  this  happens  in  English;  if 
it  does,  it  affects  words  like  you, 

(2)  In  Greek  and  Latin,  £  between  vowels  is  lost,  A  similar 
variation  in  which  and  an  adjacent  vowel  are  lost  occurs  in 
English  words  such  as  crying  [kralylij],  which  becomes 
jkrai^J. 

(b)  Change  of 

(1)  In  Greek  initial  ^  becomes  _h.  This  probably  docs  not  happen 
in  English.  If  it  does,  it  affects  words  like  you. 

(2)  In  North  Germanic,  between  vowels  becomes  ggy.  This 
does  not  happen  in  English. 

(3)  In  Gothic,  y  between  vowels  becomes  ddy.  This  does  not 
happen  in  English, 


160 


(4)  In  Welsh,  Proto-Indo-European  y  becomes  3  after  stressed 
e  or  i .  We  do  not  believe  that  this  happens  in  English, 

(5)  In  English  h  +  y  sometimes  beoomes  the  voiceless  guttural 
spirantQcJ.  This  occurs  in  the  word  hue. 

2,  Combinatory  changes: 

(a)  Consonant  clusters  in  which  y  is  changed: 

(1)  A  voiceless  guttural  or  dental  stop  +  y  becomes  ss  in  Ionic 
Greek  and  tt  in  Attic  Greek,  We  do  not  believe  that  this 
happens  in  English.  I£  it  does,  it  afiects  words  like  cute. 

(2)  A  voiced  guttural  or  denial  stop  +  y  becomes  ad  in  Creek, 

This  does  no(  happen  in  English. 

(3)  In  Greek  1  +  ^  becomes  11.  In  English  this  combination  yields 
a  palatal  lateral.  (See  Appendix  11.  II.  D  -  Changes  Involving 
Laterals.  ) 

(4)  In  Greek  £  +  X,  becomes  pt  .  This  doi:s  not  occur  in  Eng,lish. 
(b)  Consonant  clusters  in  which  y  remains: 

(1)  In  Latin  d  I  y  becomes  y  in  initial  and  medial  position.  In 
American  English,  exctqjl  for  some  Soutlierji  dialects,  the 
initial  eUisler  dy-does  not  oieur.  In  lliose  dialects  where 
due  is  pronounced  dyu  ,  this  change  may  oecur. 

(2)  In  Latin  g  I  y  becomes  y  in  initial  and  medial  position.  This 
may  occur  in  American  Englisli.  If  so,  it  affects  words  like 
argue. 


161 


(3)  In  Sanskrit  nn  +  jr  becomes  nasalized  ^  +  jri  Tbis  may 
happen  in  English,  but  we  doubt  it;  if  it  does  happen,  it 
affects  phrases  like  come  yet. 


B,  Changes  Involvinafr] 

1,  Non-combinatory  changes: 

(a)  Loss  of  r: 

In  some  dialects  of  American  English  £  after  rowels  drops 
out.  Many  Southerners  and  New  Englanders  pronounce  far 
without  an  r. 

(b)  Insertion  of  r: 

In  tliose  Americrn  dialects  which  drop  final  £  except  when  the 
following  word  starts  with  a  vowel,  an  £  is  frequently 
inserted  between  a  word  ending  with  a  vowel  and  another 
beginning  with  a  vowel,  A  New  Englander  commonly  pro¬ 
nounces  Jeer  tliat  and  idea  that  as  Jdia  ?ae^  and  [aidia  5aetj, 
but  he  pronounces  deer  js  and  ideals  as  [diriz]  and 
(^aldirlzj, 

(c)  Chsmge  of  r: 

(1)  In  some  Sanskrit  dialects  £  becomes  ^  This  probably 
happens  in  English;  it  affects  words  like  right, 

(2]  In  French  £  becomes  _z.  As  far  as  wo  know  this  does  not 
Happen  in  English, 


162 


(3)  In  Sanskrit  r^  becomes  before  an  initial  voiceless  stop 
or  sibilant.  This  probably  does  not  happen  in  English. 

2.  Combinatory  changes: 

(a)  Insertion  of  another  consonant  between  a  nasal  and  r: 

(1)  In  Greek  mr  becomes  mbr.  This  happens  sometimes  in 
English  in  phrases  like  come  running. 

(2)  In  Greek  nr  becomes  hdr,‘  This  may  happen  in  English. 
If  it  does,  it  affects  phrases  like  in  reference. 

(b)  Complete  assimilation  of  another  consonant  to  r: 

(1)  In  Latin  n£  becomes  ££.  If  this  happens  in  English,  it 
affects  phrases  like  in  reference. 

(2)  In  Latin  £s^  becomes  rr.  If  tliis  happens  in  English  it 
affects  phrases  like  for  sale. 


C,  Changes  Involving  h^ 

1.  Non-combinatory  changes: 

Loss  of  h: 

In  Greek  initial  h  and  li  between  vowels  drops  out.  This 
occurs  in  unstressed  syllables  in  English,  "is  he?"  becomes 
[izi]. 

2,  Combinatory  changes: 

Cluster  in  which  h  and  tlio  otlier  consonant  are  both  modified: 

in  Modern  English  sometimes  becomes  the  voiceless 

guttural  spirant  fp].  The  word  human  is  frequently  pro-* 

163 

nounced  rpumai^. 


D. 


Change  B  InTolvini^ 

1,  A  final  t  before  any  initial  bilabial  consonant  (j^pj  i  [^b]  » 

fwj)  is  frequently  replaced  by  [?],  We  have  observed  this 
in  several  different  American  dialects,  including  tiiose  of  Virginia 
and  Michigan;  we  believe  it  occurs  in  most  dialects.  Thus  the 
phrase  that  one  |^3aet  wAn]  becomes  j^Jae  ?  wAn]j  that  boy 
f^aet  becomes  j^Sae?  atmosphere  j^aetmasfirj  becomes 

^ae  Tmasfir"]. 

2.  A  j^before  an  1  is  normally  replaced  by  j^?]  in  certain  dialects, 
including  that  of  New  York  City,  Thus  bottle  j^b<i.t|."|  becomes 


[ba?i]. 

3,  A  is  sometimes  inserted  between  a  final  vowel  and  an  initial 
voweL  The  Ink  [Sd  I^]  becomes  [jd 


164 


APPENDIX  H.  V 


B.  CHANGES  INVOLVING  PHONES  WITH  TWO 
PLACES  OF  ARTICULATION 

A.  Changes  Involving  Labiovelar  Stops 

1.  Changes  involving  .retention  of  one  place  of  articulation  and  loss  of 
the  other: 

(a)  Retention  of  guttural  (velar)  articulation: 

(1)  In  Sanskrit  ^  and  g'^  become  k  and  before  a  consonant 
and  before  a  Proto-Indo-Europe.an  back  vowel.  In  Greek 
k"'  and  g^  become  1^  and  before  or  after  jiL 

(2)  In  Latin  k^  and  g^  become  k  and  £  before  a  consonant  or 
before  u. 

(3)  In  Greek  k^  and  g'’^  become  ^  and  g  before  or  after  u, 

(4)  In  Germanic  and  become  k  and  £  before!  a  rounded 
vowel.  If  tills  happens  in  Englisli,  it  affects  words  like!  quoti!, 

(b)  Retention  of  labial  articulation; 

(1)  In  Oscan  k^  and  g"'  become  _p  and  Ji  in  all  onvironm(:!nt.s. 

We  doubt  tliat  tlii.s  iiapp(!ns  in  EnglisJi.  If  it  does,  it  affects 
words  like  quit,  quite,  and  quote, 

(2)  In  Latin  becomes  ^  cvcrywheiU!  except  after  _i£  or  before 
a  consonant  or  u.  If  this  liappiiiis  in  English,  it  ei';!!,''‘j 

i-tlV';  OvVtii* 


Km 


2.  Changes  involving  shift  in  place  of  articulation: 

(a}  Shift  to  dental  articulation: 

In  Greek  k^  becomes  and  becomes  d  before  a  front 
vowel*  We  doubt  that  this  happens  in  English*  If  it  does, 
it  affects  words  like  quick  and  Gwen* 

(b)  Shift  to  palatal  ariiculatiun: 

In  Sanskrit  ^  becomes  _c_  and  g'"  becomes  j_  before  Proto- 
Indo-European  front  vowels.  We  doubt  that  this  happens  in 
English*  If  it  does,  it  affects  words  like  quit  and  Guam. 


B,  Changes  Involving  w 

Some  of  the  rules  listed  here  have  been  given  elsewhere  also.  All 
rules  concerning  ^  are  included  here  because  they  will  help  us  fit  w'  into 
the  model, 

1.  Non-combinatory  changes: 

(a)  Loss  of  w: 

(1)  In  Greek  initial  ^  is  lost.  This  may  not  happen  in  English. 

If  it  does,  it  affects  words  Uke  we. 

(2)  In  Greek  jw  between  vowels  is  lost.  This  probably  happens  in 
English;  it  affects  words  like  rowing,  and  reduces  them  to 
one  syllable, 

(3)  In  Old  Icelandic  w  drops  out  after  and  before  any  strongly 
rounded  vowel.  In  English  this  may  affect  phrases  like  low 
windowl 


166 


(b)  Change  of 


(1)  In  German  and  Latin  ^  becomes  v.  This  may  happen  in 
English;  if  it  does,  it  affects  words  like  we. 

(2)  In  Welsh,  initial  v/  becomes  gw.  This  also  happens  with 
Germanic  loan  words  in  French.  This  may  happen  sometimes 
in  English,  in  words  like  with. 

(3)  In  North  Germanic  w  between  vowel:,  becomes  ggw.  This 
may  happen  sometimes  in  English,  in  phrases  like  bee  wing, 

2,  Combinatory  changes: 

,  (a)  Consonant  clusters  in  which  w  is  unchanged: 

In  Sanskrit  m  i  w  becomes  nasalized  w  plus  v/.  In  English 
this  probably  happens  in  sentences  like  "Give  him  one.  " 

(b)  Consonant  clusters  in  wliich  is  lost: 

(1)  In  Latin,  dw  between  vowels  becomes  This  probably 
does  not  happen  in  English,  If  it  does,  it  affects  phrases 
like  add  one, 

(2)  In  Greek  intervocalic  Jw  becomes  1,  nw  bccon\es  n,  and 
rw  becomes  r.  In  some  dialects  the  preceding  vowel  is 
leiigOiened.  This  probably  does  not  J'.appcn  in  Englisli,  if 

it  doe.s,  it  affects  plirases  like  sell  one,  in  one,  anfl  or  one. 

(c)  Consonant  clusters  in  v/hich  w  is  completely  as siniilatetl: 

In  Germanic  nw  becomes  nn.  This  prnbahly  rlnes  not  liappcn 
in  English,  If  it  docs,  it  affects  phrases  like  in  one. 

I  (<1 


(d)  Consonant  clusters  in  which  _w  and  the  other  consonant  are  both 
changed; 

(1)  In  Latin,  initial  ^  becomes  If  this  happens,  it  affects 
words  like  dwell,’ 

(E)  In  Greek  kw  becomes  ££_  and  gw  becom.es  ^  before  an  a, 
an  o,  or  a  consonant;  kw  becomes  tt  and  ^w  becomes  ^ 
before  an  i  or  an  _e_;  kw  becomes  k  and  gw  becomes  ^  before 
or  after  m.  We  do  not  believe  that  these  rules  apply  to  English, 


168 


APPENDIX 


H,  VI 


SANDHI  RULES  OF  SANSKRIT  AND  THEIR 
APPLICATION  TO  ENGLISH 


A,  Phonetic  Roles  Relevant  to  English 

1,  When  there  are  two  or  more  consonants  at  the  end  of  a  word,  the  first 
is  retained  and  the  others  dropped.  This  sometimes  happens  in  English; 
act  becomes  ac  and  loft  becomes  lof, 

Z,  Dental  _n  coming  after  retroflex  s  or  r,  whether  vocalic  or  consonantal, 
in  the  same  word  is  changed  to  lingual  n.  This  cliange  takes  place 
even  if  a  vowel,  a  semivowel,  h,  or  any  guttural  or  labial  consonant 
comes  between  the  r  or  r<'troflex  and  dental  n.  This  cliange  does 
not  take  place  if  dental  n  ends  a  word.  In  Anerican  English,  tin-  n 
of  internal  has  its  place  of  articulation  furtlier  back  than  an  ordinary 
alveolar  Tlie  place  of  articulation  may  be  jialatal, 
lingual  r  followed  by  lingual  r  is  dropped  and  the  preceding  vowel, 
if  short,  is  made  long.  In  Ajiienicaii  English  tliis  may  happen  witli 
such  words  as  or  in  plirases  like  or  red, 

4,  Dental  s  following  any  vowel  besides  a  or  a  or  following  a  guttural 
or  a  consonantal  r  becomes  a  retroflex  s.  In  Aneriean  English  tlu! 
s  of  lease  may  have  a  palatal  rather  than  an  alveolar  place  of 
articulation. 


5,  When  preceded  by  any  stop  consonant,  _h  is  changed  to  a  voiced 


aspirated  stop  having  the  same  place  of  articulation  as  the  preceding 
consonanU  If  this  happens  in  English,  it  affects  phrases  like  black  hat. 

Phonetic  Rules  Possibly  Relevant  to  English 

■CWe  are  not  sure  whether  the  sounds  which  we  call  "aspirated"  in 
English  are  articulated  in  the  same  manner  as  the  Sanskrit  aspirates. 

If  they  are,  then  these  Sanskrit  rules  may  apply  to  English,  We  already 
know  that  some  fineil  voiceless  stops  in  English,  such  as  the  k  of  back, 
are  unaspirated.  These  rules  may  predict  where,  ) 

1,  An  aspirate  stop  or  affricate  is  changed  to  a  non-aspirate  before  another 
stop  or  before  a  sibilant;  it  stands  unaltered  only  before  a  vowel,  semi¬ 
vowel  or  nasal, 

2,  An  aspirated  stop  becomes  unaspirated  in  absolute  final  position  (at 
the  end  of  a  sentence). 

C,  Historic  and  Analogic  Rules 

1,  c_  or  2  is  changed  to  k  before  voiceless  consonants  and  g  before  any 
voiced  consonant  except  a  nasal  or  suinivowel;  this  change  also  takes 
place  when  tlio  consonants  end  a  word,  even  before  a  nasal  or  semivowel, 
(This  rule  represents  an  historic  survival.  At  an  earlier  stage  in 
tile  language,  k  and  j£  become  ^  and  in  most  phonetic  environments, 
and  remained  unchanged  in  certain  positions.  The  list  of  environments 
in  which  by  sandhi  rule  c  and  j  "become"  k  and  g  is  really  a  list 
of  tlic  environments  in  which  _k  and  ^  did  not  become  c  and  j.  The 


170 


alternation  between  _k  and  ^  and  tliat  between  c_  and  j  are  explained 
by  the  rules  governing  resonances  in.Appendijv  H,  HI. 

This  rule  suggests  that  we  should  look  for  similar  cases  of 
historic  survival  in  Knglish,  At  present  we  have  no  examples.  ) 

2,  Final  n  followed  by  a  dental,  palatal,  or  lingual  stop  becomes  ns.  The 
stop  remains  unchanged,  (We  have  already  discussed  this  example  in 
the  main  body  of  our  text.  ) 

3,  The  endings  ^  and  ^  are  governed  by  the  following  special  rules; 

(a)  When  ^  is  followed  by  a,  it  becomes  o,  and  tlie  following  ,a  is 
d.ropped, 

(b)  When  ^  is  not  followed  by  _a,  it  bccoines  £  and  tbi’  fesuiUug  hiatus 
remains. 

(c)  Before  any  voiced  sound,  as  loses  its  _s  and  the  resulting  liLatus 
remains. 

(Lu  Sanskrit  mo.st  occurrences  of  final  as  were  ca.se-eiuliiigs  for 
nouns.  These  Sanskrit  rules  suggest  the  possibility  that  .'.oii.c^  J  :,i,  It,,', 
noun  or  vt'rb  endings,  such  as  tiie  ]jo.ssessj.ve  s,  iiiay  luive  sinuiiiii 
rules  governing  tlieir  combination  witli  the  initial  sounds  ol  the  li.lh,  i,,. 
words.  At  presiuit,  however,  we  Icnow  of  no  sueli  ruii,  s„  ) 

4,  The  following  rules  goverjiiug  iinal  r  ajipear  to  lie  analnei<.j 

(a)  Before  a  pause,  r  becoines  visarga. 

(b)  Before  a  voiceless  stop,  r  may  become  visarga, 

(c)  Before  a  sibilant,  r  i..ny  become  visarga, 

(We  consider  these  I'ules  to  be  analogic  in  origin  i'or  two  jn  i  ,  . 


I 


first,  we  have  no  examples  from  any  other  language  of  £  becoming  an 
h  like  sound  by  regular  phonetic  change;  and  second  all  the  necessary 
conditions  for  an  analogy  were  present  in  Sanskrit,  Since  became 
^  before  a  voiced  sound,  in  many  phonetic  environments  the  s^  words 
had  the  same  endings  as  the  r  words,  and  the  words  were  much 
more  numerous.  Under  these  circumstances,  it  is  normal  that  the 
s^  rules  should  be  extended  to  the  £  words. 

We  have  not  yet  discovered  any  cases  in  English  which  show  a 
similar  alternation  as  a  result  of  analogic  change. ) 

5,  Before  a  pause,  s  becomes  visarga.  If  this  happens  in  English,  it 
affects  words  like  space  when  they  are  in  absolute  final  position, 

6,  Before  any  imtial  voiceless  stop,  final  £  may  become  visarga.  If  this 
happens  in  English,  it  affects  phrases  like  space  test, 

7,  Before  an  initial  sibilant  final  £  may  become  visarga.  If  this  happens 
in  English,  it  affects  phrases  like  space  shot, 

8,  Before  an  initial  sibilant,  final  £  may  become  a  sibilant  identical  to 
the  following  one.  If  this  happens  in  English,  it  affects  phrases  like 
space  shot. 

9,  After  any  vowel  except  £  orji_  (in  other  words,  after  any  vowel  except 
one  which  has  the  sound  quality  of  the  first  vowel  of  father,  regardless 
of  v/hether  it  is  long  or  short),  s  becomes  r  before  any  voiced  sound 
except  r.  If  tiiis  liappens  in  Englisli,  it  affects  phrases  like  space 
investigation. 

10,  Before  an  initial  sibilant,  final  r  may  become  a  sibilant  identical  to  the 

following  one.  If  this  happens  in  English,  it  affects  phrases  like  more  ships. 


172 


APPENDIX  H,  Vn 


A»  Mefcods  of  Repyesenting  Euphonic  Rules  Symbolically 

The  rules  shown  below  are  In  an  invironmental  form.  The  central 
phone  or  phones  are  those  to  which  a  particular  transformation  occurs. 
Those  separated  from  the  central  ones  by  outward-facing  brackets  are 
the  environment  and  represent  the  conditions  under  which  a  transformation 
will  apply.  Thus  n]  t  [s  represents  the  phone  t'  in  the  environment  "before 
£  and  after  n.  "  The  result  of  a  transformation  is  indicated  by  an  arrow, 
so  that  a  rule  n^  t  fs  would  mean  that_t 

drops  between  n^and  s,  the  blank  space  after  the  arrow  meaning  "no  phone,  " 

When  a  rule  applies  to  some  phone  in  more  than  one  environment, 
this  may  be  indicated  by  placing  each  environment  on  a  separate  line. 

For  instance  n 

would  mean  that  t_drop8  cither  when  it  is  between  n  and  or  when  it 
precedes  an  £  followed  by  another  U  The  +  notation  means  that  one 
phone  follows  another,  in  this  case  ^follows  £  on  the  second  line  of  the 
rule. 

It  is  often  necessary  to  refer  to  whole  classes  of  phones  rather 
than  to  single  ones.  The  symbols  and  V"  refer  to  the  classes  of  con¬ 
sonants  and  vowels.  If  particular  characteristics  are  needed,  these 
are  placed  In  parentheses  after  the  class  or  phone  symbol,  as  C  (v) 
for  voiced  consonants,  n  (D)  for  dental  n,  and  so  forth,  (A  complete 


173 


Uat  of  these  abbreviations  appears  at  the  end  of  this  section*  )  The  fuie 


following; 


(D) 


J  L«  +  c.- 

means  that  any  dental  phone  is  dropped  between  n  and  s  or  before  s 
followed  by  any  other  consonant* 

Since  tlie  dimensions  of  our  model —~mamiBr  of  articulation, 
place  of  articulation,  and  resonances  (aspiration  and  voicing)  play  a 
special  part,  a  tliree-position  notation  is  often  used.  The  three  parts, 
separated  by  commas,  represent  the  three  dimensions,  respectively. 
Thus  (Af,  Lid,  -a+v)  represents  the  phone  v  which  is  articulated  as  an 
affricate,  is  articulated  in  the  labiodental  position,  is  unaspirated  and 
voiced,  (Sec  list  of  abbreviations.  )  This  notation  permits  us  several 
conveniences.  We  may  leave  a  position  blank  to  indicate  tliat  any  "value" 
in  that  dimension  is  valid  in  the  rule  desired.  We  may  omit  the  letters 
a  and  v  for  aspiration  and  voicing,  so  that  v  might  be  written  (Af,  Ld, 

- +),  and  any  unaspirated,  voiced  phone  could  bo  Written  (,  ,  «  +),  We 
may  place  one  symbol  above  another  to  indicate  several  choices  for 
one  dimension,  so  that  (Sp,  A,  )  would  mean  any  dental  or  alveolar 
affricate,  spirant,  or  sibilant.  (Sec  abbreviations.  ) 

Using  eitlier  firm  of  the  parenthesis  notation,  we  may  further 
economize  on  symbols  by  indicating  several  phones  with  shared  charac¬ 
teristics  using  a  single  parenthesis  and  placing  the  symbols  one  above 
tlie  oUier,  as  ”  (D)  for  dental  or  dental  U 

When  an  environment  allows  for  several  phones  to  precede  the 
same  following  set  of  phones,  the  shared  phones  may  be  written  on 


the  same  line  with  commas  iaterTeamg,  as  n,  sj  t  ^s  for 

n  _t  £  or  £_t  £»  or  ”■*  t  j^*  meaning  any  of  the  five  environ¬ 
ments  [^8  ,  ®J  L®  '  ”]  »  J  ■*■  H: 

For  convenience!  when  no  preceding  or  following  environment  is  specified, 
the  appropriate  bracket  may  be  omitted.  No  confusion  should  result,  as 
the  brackets  are  always  directed  so  as  to  contain  the  environmental  phones 
rather  than  the  central  phones.  The  symbols  t  then  mean 


^  before  s  or  a  vowel, 

A  few  special  symbols  are  used.  Superscripts  ,  ,  mean 

respectively  "lengthened",  "high  stress",  "low  stress",  so  that  is 
a  lengthened  vowel,  (v)"  is  any  unstressed  voiced  phone.  The  letters 
X,  y,  z  inside  parentheses  are  roeerved  as  variables.  The  rule 
(St)  [  (St,  X,  yz)  (St,  X,  yz)  means  that 

a  stop  coming  before  another  stop  takes  the  same  place  of  artictilation, 
aspiration,  and  voicing  as  the  following  stop,  A  rule  (St,  x,  ■) 

^(St,  X,  )  . means  that  a  stop  preceding  one  with  tlie  same 

place  of  articulation  is  dropped.  Subscripts  are  used  to  indicate  identical 
phones,  so  tliat  w  j^Vj  means  w  between  two  Identical  co.n- 

sonants,  tlie  second  being  unstressed,  •  and  **  mean  word-break 
and  end  of  sentence. 


175 


The  following  symbols  are  ased  in  ttie- rules: 
1)  Manner  of  Articulation 


St 

Stop 

Lg 

Lingual 

N 

Nasal  (nasalized) 

R 

Retroflex 

Af 

Affricate 

V 

Visarga 

L 

.Liateral 

Cs 

Consonantal 

S 

Sibilant 

Ch 

(Characteristics 

yet  to  be  determined) 

Place  of  articulation 

G1 

Glottal 

D 

Dental 

G 

Guttural 

Ld 

Labiodental 

P 

Palatal 

B 

Bilabial 

A 

Alveolar 

Resonances 

+  a 

Aspirated 

»  a 

Unaspirated 

+  v 

Voiced 

-V 

Unvoiced 

Otliers 

II 

£ 

Consonant 

Unstressed 

V 

Vowel 

X. 

Variable  characti 

* 

Pause 

1 

Identical  phone 

Final  Pause 

+ 

Lengthened 

t 

Stressed 

176 


B,  Symbolic  Rules  for  Euphonic  Combination 

(Lower  case  letters  following  rule  numbers  refer  to  notes  at  the 
end  of  the  list, ) 


*  +  (+v) 


1  a 

^  ] 

*  [  h.  V 

2b 

1 - 1 

>1 

>1 

I _ 1 

«• 

4  a 

V  (ch)  j 

*  ^  V(ch) 

2  a 

i  J 

Tv 

L  - 

h 

2  c 

y] 

y  [V 

8_  (St,Xj  )J  h 


9. 

h  V 

20  c 

2(ch)  J 

v_{ch) 

11_  c 

V(ch)  j 

[  V(ch) 

22  d 

V  (ch) 

+  V(ch) 

C  (+T) 


r 

? 


(St,  X,  I-  !') 


y 

w 


V  (ch) 


1  77 


36 


37 


38 


t 

P  +  * 

y 


J 


Sj  1' 

[• 


39  a,f 


g 


41 


42 


43 


44 


45 


'16  C 


V  +  r 


V  +  r 


£ 

•  +  C 


nj  [r 

m  J  1,  r 

h  +  y 

(St,  D,  x)  +  y 

Si  [  £i 

]  2  [  • 


V  (V) 

V  (V) 


[AX,  P,  -x) 


I  80 


I 


NOTES: 


a. .  Part  of  a  pause-dropping  scheme  not  yet  fully  developed.  We  know 

pause  does  not  drop  before  bilabials, 

b.  New  England  Dialect  only, 

c.  ^  and  w  may  add  or  subtract  between  vowels,  but  the  exact  conditions 
are  not  well  known.  Acoustic  data,  some  of  which  is  included  in  this 
report  may  soon  be  able  to  further  clarify  these  conditions, 

d.  Although  we  do  not  yet  know  the  conditions  for  vowel  coalescence, 
this  rule  is  a  vital  part  of  the  scheme  for  handling  vowels, 

e.  The  condition  (not  1)  is  spurious  since  we  regard  a  "double"  of 
any  phoneme  other  than  a  stop  as  a  mere  lengthening,  (See  also 
rule  j(s) 

f.  Some  dialects  omit  the  visarga, 

g.  Southern  dialect  only. 


1«1 


C,  Relation  of  Present  Symbolic  Rules  to  Rules  of  Euphonic  Combination 
Previoualy  developed' 

The  list  following  shows  how  the  data  presented  in  the  April  and 
May  reports  were  incorporated  into  the  mathematical  formulation  of  rules 
for  articulation.  Rule  numbers  correspond  to  those  in  part  B.  of  this 
Appendix  A  plus  sign  indicates  that  the  combined 

effect  of  several  rules  must  be  used  to  achieve  the  result  of  the  verbal 
description.  Abbreviations  used  are: 

na  Not  applicable  to  English 

nt  Not  true  for  English 

p  Partially  used  in  .  •  ■  (rule  number) 

c  Contradicts  .  ,  .  (rule  number) 

d  Doubtful  validity 

al  May  be  added  at  a  later  time. 


Appendix  II.  I. 

1,  (a)  nt 

•  2.(e) 

11 

3.  (c) 

al 

6,  (a) 

nt 

(b) 

nt 

(f) 

44 

(d) 

al 

(b) 

al 

(c) 

11 

(e) 

30 

(c) 

hi 

(c) 

al 

(cl) 

nt 

(b) 

29 

4.  (a) 

nt 

(d) 

al 

2.  (a) 

hi 

(i) 

na 

(b) 

d 

7.  (a) 

23 

(b) 

26 

U) 

na 

(c) 

al 

(b) 

(c) 

nt 

3.  (a) 

il 

5.  (a) 

al 

(c) 

1.1 

(d) 

nt 

(b) 

11 

(b) 

al 

I8<; 


Appendix  H.  Ill 


A.  1.  (a) 

1 

2.  (a) 

il 

2.  (d) 

Cb) 

ii 

(b) 

(b) 

il 

3. 

ii 

(c) 

il 

(c) 

j[7 

(c) 

il 

B.  1. (a) 

il 

Appendix  H,  IV 


A«  1.  (a) 

al 

5.(b) 

il 

9.  (a) 

37 

2.(c) 

b 

(b) 

al 

(c) 

nt 

(b> 

37' 

3.  (a) 

d 

(c) 

al 

(d) 

38 

10. 

32 

(b) 

d 

2. 

al 

(e) 

ci2,  ^ 

B.  1. (a) 

nt 

(c) 

d 

3.  (a) 

26 

6.  (a) 

33 

(b) 

11 

4.  (a) 

d 

(b) 

nt 

(b) 

33 

(c) 

p34' 

(b) 

d 

4.  (a) 

e.33 

(c) 

nt 

(d) 

p  34 

(c) 

d 

(b) 

nt 

7. 

11 

(e) 

nt 

(d) 

7 

(c) 

nt 

8.  (a) 

al 

2.  (a) 

6 

5. 

d 

5.  (a) 

c  44 

(b) 

al 

(b) 

]_ 

Appendix  il,  VI 


A.i. 

'1 6 

n.i. 

il 

(b) 

ixa 

5. 

15 

2. 

2. 

ii 

(c) 

na 

6. 

il 

3. 

11 

ca. 

d 

4.  (a) 

al 

7. 

4. 

24 

2. 

na 

(b) 

al 

8. 

11 

5. 

8 

3.  (a) 

na 

(c) 

al 

9. 

il 

10. 

25 

183 


Appendix  H.  II 


A,l.(a) 

d 

2.  (a) 

al 

(e)p)  d 

(d)P)  d 

(b) 

d 

(b) 

d 

2,  (a)p)  ^ 

P)  al 

(o) 

d 

(c) 

d 

p)  p36' 

(4)  d 

(d) 

al 

(d) 

al 

(b)p)  nt 

p)  35 

2.  (a) 

nt 

(e) 

al 

P)  d 

P)135 

(b) 

nt 

(f) 

al 

P)  na 

(7)  35 

B.l. 

d 

D.l.(a) 

d 

(4)  al 

P)  35 

2. 

d 

(b) 

nt 

P)  36 

P)  36 

C,l.(a) 

d 

(c) 

d 

(c)P) 

(e)  d 

(b) 

pl7 

(d) 

nt 

p)  42 

(c) 

nt 

(e)P)  d 

(d)(1)  d 

Appendix  H.  V.  A 

A.1.  (a)a)  d 

2,  (a)P)  nt 

B.l. (a)  ^ 

(b)P) 

p)  6  & 

P)  nt 

(b)  3 

P)  c£5 

(b)(1)  d 

(c)a)  al 

C.l. 

P  19, 

p)  nt 

(4)  nt 

p)  nt 

2. 

il 

0  nt 

(b)&)  d 

P)  nt 

D.l. 

34 

(i)  nt 

P)  d 

2.  (a)P)  42 

2. 

li 

0  43 

(3)  d 

P)  41 

3. 

4 

Appendix  IV  V,  13. 

All,  (a)(1)  na  ^)  c£I_  (3)  pX 

(1)  na  2,  (a)  d  (b)(1)  d 

(3)  na  (b)  d  p)  ^ 

(4)  na  B,  U  (a)(1)  d,  c  ^  (?)  d 

(b)(1)  na  P)  p2_  2,  (a)  2^ 


(b) (1)  d 
p)  d 

(c)  d 

(d) (1)  d 
p)  d 


184 


APPENDIX 


K.  vin 


RULES  OF  SOUND  SHIFT  DERIVED  FROM  MART1NET*S  THEORY 
OF  MINIMUM  EXERTION  IN  ARTICULATION 
FOR  POSSIBLE  USE  IN  OUR  MODEL 

1,  Before  or  after  a  dental,  an  alveolar  may  become  a  dental,  as  in 
health  and  width. 

2,  Before  or  after  a  labial  or  labiodental,  an  alveolar  (except  ^  may 
become  a  dental,  as  in  apt  and  at  peace. 

3,  Before  or  after  a  guttural,  an  alveolar  may  become  a  palatal  in  words 
like  books  and  act. 

4,  Before  or  After  a  labial  or  a  labiodental,  a  palatal  may  become  alveolar. 
This  occurs  in  phrases  like  ash  bin. 

5,  Before  or  after  a  labiodental  or  an  alveolar,  a  palatal  hushing  sibilant 
may  become  alveolar.  This  occurs  in  phrases  like  red  shoes. 

6,  l_is  ba8ic.al).y  an  alveolar  consonant,  but  before  or  after  a  back  vowel, 

a  palatal  consonant,  or  a  guttural  consonant,  it  normally  takes  the  place 
of  articulation  of  the  adjacent  phone;  it  may  alLuinately  have  its  place 
of  articulation  at  the  alveolar  ridge,  or  between  the  alveolar  ridge 
and  its  normal  place  of  articulation  if  the  speaker  is  concerned  about 
clarity, 

7,  Before  a  labiodental,  a  labial  may  become  labiodental.  This  occurs  in 
phrases  like  am  fine  and  clip  four, 

8,  After  an  £,  an  affricate  may  become  an  alveolar.  In  church,  the  second 
j^c'Jis  fartlier  forward  than  the  first. 


185 


APPENDIX  H.IX 

FURTHER  RUI.es  OF  SOUND  CHANGE 
(This  Appendix  includes  rules  which  have  not  previously  been 
discussed  or  listed,  rules  which  have  been  discussed  in  the  text,  but 
which  have  not  been  listed  in  previous  sections  of  tliis  Appendix,  and  rules 
which  have  been  revised  to  agree  with  acoustic  data  studied.  ) 

1.  A  final  consonant  of  one  word  may  be  attached  to  an  initial  phone  of  the 
following  word,  as  in  phrases  like  make-up  and  made  of.  [See  discussion 
on  page  58  of  the  text  of  this  report.  ) 

2.  When  a  nasal  is  followed  by  a  voiceless  sibilant  or  spirant,  a  voiceless 
stop  may  be  inserted  between  them;  mince  may  be  pronounced  mints, 

(This  rule  has  not  been  discussed  previously.  ) 

On  page  40  of  the  text,  we  noted  that  sometimes  "the  almost 
disappears  from  rents.  "  This  example  seems  to  cmntradict  this  rule, 
since  according  to  the  rule,  a  t  would  probably  be  inserted  in  such  a 
plionetic  context.  Truby  has  shown  (Truby,  1959,  p.  206)  that  tile  words 
prince  and  prints  ar<'  bomonynis;  sometimes  there  is  a  slop-gap  present 
.and  soiiietiiiu's  not,  but  prince  lias  a  stop-gap  as  often  as  prints  .  The 
saim'  tiling  is  true  of  lens  and  lends. 

3.  A  voiced  l  onsonant  in  a  voiced  environment  may  Ijecome  voiceless  and 
sometimes  .aspir.'iled;  this  may  ciffect  words  like  rouge  and  begin,  'Phis 
rule  has  not  been  discussed  previously.  ) 


186 


4.  A  vowel  may  be  replaced  by  a  visarga  vowel,  as  in  the  New  England  pro¬ 
nunciation  of  words  like  car  and  yard.  (See  discussions  on  page  58  of 
the  text  and  in  Appendix  I.  In  Appendix  H.  VI  (p.  171)  visarga  is  referred 

to  as  "analogic"  in  nature  --  the  acoustic  data  definitely  indicates  that  such 
sound  changes  occur  in  English.  ) 

5.  Final  ^  may  drop  out  after  n,  as  in  words  like  land .  (This  rule  has  not 
been  discussed  previously.  ) 

6.  A  glottal  stop  may  occur  before  any  initial  vowel,  as  in  words  like  one, 

at  ,  and  among  .  (This  rule  was  originally  included  in  Appendix  H.  V.  -page 
164,  but  it  is  being  revised  here  to  conform  to  evidence  found  in  the  data 
studied.  ) 

7.  An  alveolar  stop  between  two  consonants  may  drop  out.  (In  Appendix  H.  IV, 
we  said  that  a  t  may  drop  bolwccn  two  s's.  We  arc  here  enlarging  the  scope 
of  this  rule.  ) 

8.  An  alveolar  stop  before  an  affricate  may  drop  out.  (This  rule  has  not 
been  discussed  previously). 

9.  When  two  identical  consonants  come  together,  tliey  form  one  long  con¬ 
sonant,  that  is,  a  eonsonant  in  wIiLcIi  one  part  is  considerably  longer  than 
normal.  (In  the  case  of  stop  consonants  this  long  jiart  is  tlie  stop  gap;  in 
the  case  of  spirants  and  sibilants  it  is  the  fricative  portion.  )  This  long 
consonant  may  be  shortened  to  the  normal  length  of  a  single  consonant. 

This  rule  applies  to  any  combination  of  identical  consonants,  re- 


187 


gardlesE  of  whether  both  these  consonants  are  normal  in  the  particular 
words  involved  or  whether  one  is  the  product  of  another  rule  of  euphonic 
combination.  (This  was  discussed  in  Section  3  -  page  52  of  the  test  -  but 
it  was  not  included  in  the  rules.  ) 

10.  In  some  New  England  dialects  an  r  after  a  vowel  is  dropped  except  when 

r  stands  at  the  end  of  a  word  and  the  next  word  begins  with  a  vowel.  (This 
was  discussed  on  page  57  of  the  text  hut  was  not  included  as  a  rule.  ) 

11.  In  some  New  England  dialects  the  r  may  be  inserted  between  a  word 
ending  with  a  vov/el  and  a  word  beginning  with  a  vowel.  (This  was  discussed 
on  page  57  of  the  text,  but  was  not  included  as  a  rule.  ) 


188 


APPENDIX  I 


VISARGA  VOWELS 

Visarga  is  a  class  of  vowels  defined  by  Sanskrit  phoneticians 
as  sounds  that  are  co-*articulations  of  vowels  with  h  or  aspiration. 

This  class  of  vowels  is  not  mentioned  among  most  European 
vowel  sounds,  and  it  has  not  been  referred  to  in  acoustic  phonetic 
studies.  The  commonly  used  methods  of  phonemic  analysis  of 
languages,  described  in  our  previous  report,  probably  do  not  yield 
such  phonemes  in  carefully  articulated  speech  of  tlie  languages 
referred  to. 

However,  if  the  freedoms  in  the  articulation  of  speech  make 
it  possible  to  generate  such  sounds,  then  tliey  could  occasionally 
occur  in  continuous  speech  of  one  or  more  of  these  languages,  even 
if  tlic  presence  of  visarga  is  not  formally  recogniaud  for  these,  For 
this  reason,  tlie  generation  of  visarga  and  the  expected  acoustic 
characteristics  of  its  waveforms  arc  considered  next. 

For  generation  of  visarga,  tlie  position  of  cheeks,  tongue  and  lips 
arc  tlie  same  as  those  for  the  vowel  of  its  form.  The  only  difference 
between  a  visarga  vowel  and  a  normal  vowel  is  that  tliere  is  a  steady 


isy 


stream  of  air  flow  from  the  vocal  cords  for  the  former  as  opposed  to 
significant  periodic  interruption  of  such  a  flow  for  the  latter.  This  is 
illustrated  in  Figure  75. 

Since  the  output  of  the  vocal  flaps  passes  through  the  mouth  cavity 
that  represents  situations  which  are  similar  for  both  types  of  vowels, 
the  formant  frequency  levels  are  expected  to  be  about  the  same  for 
botli  these  types  of  vowels.  However,  since  tliere  is  some  difference 
in  the  spectral  c.ha.racteristicB  of  the  sounds  of  these  two  types  of 
vowels;  there  should  be  some  difference  in  the  formant  frequency 
levels,  also. 

The  presence  of  flow  of  air,  as  in  the  articulation  of  Ji_,  should 
produce  additional  frictional  energy  in  tlie  case  of  visarga  vowels,  whereas 
such  energy  is  absent  in  normal  vowel  articulation, 

Witli  the  preceding  discussion  of  visarga  vowels  it  is  worth  con¬ 
sidering  tile  possibility  of  their  occurrence  in  the  English  language, 

When  one  considers  an  expression  ending  in  a  sigh  or  an  exclam¬ 
ation  indicating  a  relief,  tliere  is  a  distinct  possibility  of  generation 
of  vi,sarga  vowels. 

Another  possible  occurrence  of  visarga  vowels  could  be  tlic  vowels 
in  tlie  Boston  accent  tliat  precede  an  ending  _r_,  iiw.h  as  a  in  car;  but 
tills  aspect  needs  to  be  substantiated. 

Since  the  freedoms  in  tlie  methods  of  speech  generation  indicate 
tlic  possibility  of  production  of  visarga  vowels,  and  since  the  rules  of 


190 


19qb 


euphocuc  combination  indicate  the  possibility  of  their  occurrence  in 

I 

English  cosTersatlon,  these  vowels  are  added  along  the  vowel  dimension 
of  the  model. 

For  conditions  of  production  of  visarga,  described  above,  the 
vocal  flaps  can  be  considered  to  be  partly  open  at  all  times  and  also 
in  oscillating  movement  that  resvilts  in  modulation  of  the  air  stream. 

Such  a  source  of  acoustic  energy  essentially  produces  a  spectral  patterns 
such  as  illustrated  in  Figure  76. 


191 


APPENDIX  J 


THE  IMPORTANCE  OF  THE  VOCODER  IN 
ACOUSTIC  AND  PHONETIC  RESEARCH 


One  of  the  important  tools  developed  and  used  in  modern  acoustics 
is  the  "vocoder",  originally  conceived  at  the  beginning  of  World  War  II. 

The  primary  aims  of  its  development  were  security  of  communication 
and  reduction  of  bandwidth  needed  for  transmission  of  speech. 

To  accomplish  its  effects  the  vocoder  uses  a  bank  of  band-pass 
filters  that  can  divide  the  speech  wave  into  several  bandwidths  of  frequency; 
each  filter  measures  the  energy  concentration  within  its  given  range  of 
frequency,  and  each  filter  emits  a  single  wave  which  represents  the  com¬ 
posite  energy  of  all  sound  waves  of  the  original  speech  that  fall  within 
the  band-pass  filter  range.  Receiving  equipm  nt  picks  up  the  several 
waves  representing  energy  and  uses  them  to  control  the  output  of  several 
oscillators,  each  assigned  to  a  separate  filter  at  the  sending  end.  By 
combining  the  output  of  lliese  oscillators  into  one  wave  and 
feeding  this  wave  into  a  loudspeaker  one  can  create  a  reasonable  approx- 
mation  ot  original  speech.  It  should  be  noted  that  most  vocoders  and 
their  adaptions  measure  speech  frequency  up  to  4000  cycles  per  second 
although  the  actual  speech  wave  has  a  range  of  10,  000  cycles  per  second 
or  higher.  The  reason  for  this  is  that  telephone  netowrks  had  already 
demonstrated  the  possibility  of  transmitting  recognizable  gross  character¬ 
istics  of  the  human  voice  within  a  range  not  exceeding  4000  cycles. 

Immediate  uses  of  the  vocoders  were  twofold.  By  "quantizing" 
energy  of  speech  in  accordance  with  llie  bandwidths  of  the  filters,  the 
vocoder  reduced  the  amount  of  electronic  information  necessary  for 
transmission  of  speech  by  a  ratio  of  about  10:1.  At  the  same  time  the 
use  of  filters  and  oscillators  made  it  possible  to  "scramble"  transmitted 
information  by  rapidly  alternating  the  combinations  of  carrier  frequencies 
assigned  to  fillers  as  well  as  to  their  respective  receiving  oscillators. 

Tin:  first  machine  to  utilize  the  process  ju.st  described  wa.s  hater 
identified  as  the  analogue  channel  vocoder  to  distinguish  it  from  later 
adaptations  discussed  below.  It  differs  from  such  aclaption.s  in  that  it 
transmits  information  about  the  energy  output  of  each  filter  continuously, 

At  present,  there  continues  a  discussion  about  the  ideal  bandwidth 
for  vocoder  filters,  as  well  as  the  attenuation  characteristics  of  the 
fillers'  "skirts.  "  The  problem  is  particularly  important  in  tlie  meaningful 
identification  of  information -bearing  elements  of  the  speech  wave 
cither  in  speech  transmission  or  recognition,  considered  in  the  following 


192 


section.  The  greater  the  filter  band-width,  the  less  the  resolution  in  the 
sound-wave  energy;  the  smaller  the  filter  band-width,  however,  the  less 
possible  it  is  to  identify  gross  characteristics  of  the  speech  wave  from 
energy  in  the  harmonics  of  vocal  flap  frequency.  The  current  trend  is 
towards  using  filter  bandwidths  from  50  to  400  cycles  per  second  and 
having  "skirts"  with  a  gradual  Xsach  as  about  12db  per  octave)  rather 
than  a  stpsp  slope. 

Such  problems  are  the  subject  of  a  paper  by  C.  G.  M.  h'ant 
("Acoustic  analysis  and  synthesis  of  speech  with  applications  to  Swedish", 
Ericsson  Technics  15,  No.  1(1959)3-108),  and  of  recent  work  at  RCA 
(on  contract  with  WADD),  An  alternative  approach  to  the  passing  of 
speech  through  a  bank  of  band-pass  filters  has  been  developed  by  the 
Federal  Scientific  (RADC  contract  No.  AF  30  (602-1615).  Since  the 
subject  of  transformed  waves  is  highly  specialized,  we  refer  the  readers 
particularly  interested  in  this  subject  to  these  papers. 

An  early  modification  of  the  analogue  vocoder  is  the  digitized 
vocoder.  It  is  cccentially  the  same  as  an  analogue  channel  vocoder,  but 
the  energy  output  of  the  filter  bank  is  sampled  periodically  and  the  level 
of  this  energy  at  sampling  time  is.  transmitted  by  pulses. 

Later  adaptions  of  the  digital  vocoder  had  two  primary  objectives. 

One  goal  was  to  transcribe  more  precise  information  about  the 
information-bearing  characteristics  of  speech;  this  need  also  contributed 
to  the  exact  classification  of  phone  groups  and  formants.  It  also  produced 
the  formant  vocoders,  which  only  use  three  or  four  filters  to  break  up 
the  entire  speech  v/ave.  The  second  goal  was  to  reduce  significantly 
the  rate  of  information  transmission  m^cdod  for  speech  cummunicaliun; 
siu:h  needs  led  to  the  development  of  Caldwell  Smith's  tnodiCied  vocoder, 
and  it  also  gave  an  impcrlus  to  researeh  into  tlu;  exact  acoustic  classification 
of  various  phone  groups.  These  efforts  arc  discussed  below. 

Development  of  vocoding  techniques  gave  scientisls  their  first 
im  enlive  to  measure  different  spectral  density  distributions  at  different 
intervals  of  time.  A  parti<  ular  impetus  for  this  work  was  based  on 
tt>e  differences  in  spei  liMl  density  which  seemed  to  be  directly  related  to 
differences  in  sound  that  could  bo  identilied  by  ear.  Such  work  led  to 
tile  dt;velopmenl  of  the  speetrogi’aph.  A  spuclrogriiph  is  a  spi'ciial  fool 
developed  for  a  eartrful  study  of  spectral  tU-'iisily  dislribuLions  of  speccli 
waves. 

Tile  spectrograph  enabled  scientists  to  jjroduc:e  grapliic  illustra¬ 
tions  of  formants  as  functions  of  the  words  articulated  and  of  lime.  Formants 
are  the  main  regions  of  peaks  of  spectral  density  envelopes.  The  voiced 


193 


portions  of  speech  waves  contain  about  three  formants  that  are  considered 
to  convey  significant  information.  These  correspond  to  the  three  prin¬ 
cipal  resonance  chambers  formed  by  various  coupling  of  the  throat 
and  mouth  cavities. 

Conceptually,  it  should  be  noted  that  the  precise  characterization 
of  formants  is  still  a  subject  of  discussion.  In  classic  acoustic  a  formant 
is  defined  as  the  peak  of  the  envelope  of  the  spectral  density  distribution, 
but  difficulty  in  determining  these  peaks  precisely  has  led  to  inaccuracies 
of  measurement.  As  an  operational  definition  Peterson  and  Barney,  whose 
work  is  discussed  later  in  this  section,  have  suggested  applying  the  term 
formant  to  center  of  gravity  of  the  spectral  density  in  the  regions  of  three 
decibels  on  either  side  of  the  density  peak.  This  still  leaves  open  the 
situation  where  the  peaks  are  close  together  and  the  regions  do  not  show 
a  a  db  depression  in  the  spectral  density  loved.  Graphically,  the  formant 
appears  on  a  spectrogram  as  a  dark  band  representing  concentration  of 
energy  whose  frequency  level  varies  with  time.  This  band  may  be 
divided  into  transitions  (onglide  and  offglide)  caused  by  movement  of  the 
articulatory  organs  from  one  sound  to  another  and  the  production  of  a 
specific  tone  ^stcadyslate  ).  Specific  investigation  into  the  nature  of 
formants  has  been  particularly  oriented  toward  identifying  phones  by 
the  characteristic  slope  of  their  transitions  and  the  level  of  their  steady- 
state  s  ■ 

The  first  extensive  published  results  uf  speclrographie  analysis 
of  formant  energy  distribution  are  reported  in  the  textbook  Visible 
Speech  by  Potter,  Kopp,  and  Green.  In  tliis  study  llic  spectral  densities 
of  different  speech  wave  -forms  are  displayed  as  functions  of  time.  On 
the  spectrograms  each  speech  wave  usually  reveals  from  two  to  five 
distinct  formants.  Spectrographic  transc i'ii)li(jn.s  of  speech  wi-re  made 
and  the  transcriptions  classified  both  according  to  the  identities  of  the 
speakers  and  according  to  the  senl»:nc(ns  spoken  for  producing  the  wave 
I'uriii  s, 

lUmian  oliservors,  it  was  discovered,  could  iudually  read  liiese 
patterns  as  whole  s<-nlences  and  even  relate  Iragmenled  portions  of  ttie 
soiiiul  wave  patterns  lo  those  portions  of  the  spoken  sentences  that  pro¬ 
duced  them.  I'liis  w.as  true  even  when  the  seiitciices  wi'i'e  .spokeii  by  a 
varieiy  of  cli f£t!j*«-iil  .s]>eakers  selected  for  Ilur  experiment. 

'I'he  ability  of  speakers  lo  recognize  sounds  by  their  gross  eiu.'i-gy 
disli'ibulions  alone  suggested  tin-  value  of  further  invi-stigation  into  the 
iicilure  of  siK’ii  distribution,  such  iis  tlur  classification  of  phones,  discussed 
belov/.  One  of  tin:  objectives  of  such  work  was  llie  reduction  in  I'ale  of 


information  transmission  for  speech  communication  and  the  other  was 
development  of  speech  actuated  machinery. 

With  digitized  vocoders  it  is  necessary  to  transmit  information 
about  energy  levels  at  approximately  twenty  frequency  ranges.  If  one 
could  identify  sounds  through  gross  characteristics  of  their  formants 
the  number  of  ranges  about  which  information  need  be  transmitted 
might  be  reduced  to  three  energy  bands  and  the  identification  of  pitch. 

This  possibility  was  indicated  by  spectrographic  studies  of  speech  wherein 
formants  were  indicative  of  differences  between  the  various  sounds  in 
speech. 

Drs.  Peterson  and  Barney  of  Bell  Telephone  L,aburatories  have 
also  investigated  the  characteristics  of  steady  state  formants  as  identifiers 
of  vowels  of  English. 

The  experimental  procedure  of  this  research  was  to  analyze  the 
spectral  densities  of  specified  English  vowel  sounds  positioned  between 
the  consonant  sounds  "h"  and  "d.  "  Experimenters  used  phonetic  data 
from  the  enunciations  of  75  select  and  trained  speakers.  Researchers 
took  pains  to  obtaiit  very  careful  enunciation;  each  speaker's  vowel 
sounds  were  tested  on  a  random  audience  before  and  after  spectrographic 
recording  to  determine  whether  the  sound  enunciated  was  identifiabli; 
as  a  specific  vowel. 

Although  the,  relevancy  of  experiments  witli  careful  articulation 
to  the  transcription  of  general  rapid  speech  is  still  somewhat  questionable, 
tlie  Peterson,  Barney  data  has  indicated  that  leaels  of  first  and  second 
formant  frequencies  gave  reasonable  indication  of  about  90%  of  the  vowel 
sounds  studied;  the  data  also  indicated  formant  overlap  in  regions  repre¬ 
senting  two  or  more  vowels.  Such  overlaps  geni^ralJy  resulted  from  the 
diff(!rent  speech  characteristics  of  different  subjects.  The  experi- 
ineiits  of  Peterson  and  Barney  Ihns  indieaK^  that  it  may  be  po.ssible  to 
represent  steady-stale  s'owel  sounds  solely  by  graphing  the  levels  of 
formant  frequemey. 

Studies  ot  spectrograms  such  as  tliose  published  in  Visible  Speech 
inorcuver  indicate  llie  possibility  of  repre sinli ug  consonant  sounds  by 
noise  bursts  preceding  or  following  the  vowel  sounds,  then  adding  a 
suitable  transicuit  to  the  formant  lirv(fls  belwmjn  tlie  eonsonant  nt^ise  burst 
and  the  vowel  steady  slate. 

Moreover,  the  possibility  of  transmitting  information  about 
formant.s  only  was  fuiMh<  r  substantiated  by  study  of  stylized  patterns 
of  speech  speetrograms  at  Haskins  Laboratories  and  synthesized  speech 
produced  by  their  Patlurn  Playback.  {Rusuareliers  produced  lliese 


195 


patterns  by  painting  formant  patterns  on  celluloid  sheets  graphed  to 
measure  frequency  levels.  When  the  selluloid  passed  across  the  Play¬ 
back  beams,  the  light  reflected  from  the  paint  actuated  the  production 
of  tones  at  various  levels  of  frequency.  Simulated  sounds  so  produced 
were  presented  to  human  listeners  for  evaluation  of  the  "quality"  of 
the  synthetic  representation  of  phonemes  studied  for  any  test.  ) 

By  experimenting  with  the  shape  of  formant  patterns,  the  ob¬ 
serving  the  resultant  change  in  sound,  Haskins  Laboratories  were  able  to 
generate  a  considerable  amount  of  valuable  information  defining  the  exact 
phonetic  changes  produced  by  shifts  in  sound-wave  energy.  By  changing 
the  frequency  and  duration  of  onglides  and  offglidus  of  vowel  and  conson¬ 
ant  formants,  for  example,  researchers  found  they  could  produce  sounds 
similar  to  semi-vowels.  Among  other  phonetic -acoustic  characteristics 
about  spoi!ch  perception  that  were  formalized  by  Haskins  Laboratories 
are  those  of  stop  consonants  and  fricatives. 

Such  initial  success  in  relating  acoustic  information  about 
formants  to  phonetic  perception  led  to  two  parallel  efforts  --  duvtdop- 
intMil  of  formant  vocoders,  and  precise  classification  of  phone  groups. 

Development  of  formant  vocoders,  using  only  three  or  four 
filters  to  track  formants,  was  motivated  by  the  desire  to  reduce  the 
rale  of  information  transmission  for  speech  communication,  as  has  been 
previously  noted, 

Kcfineinents  of  formant  vocoders  have  also  conlribulcul  to  Hie 
mori!  (iCfieient  transmission  of  speech.  Early  formant  vocoders  operated 
by  transmitting  information  about  those  filters  which  had  speiUral  density 
inaxima  in  their  frequency  range.  The  shifts  in  formant  frequency 
with  lime  were  indi<'aled  only  when  such  a  change  represi  nted  the  sluCl 
of  (his  maxiina  from  lh<'  frcqiu-ncy  range  of  one  filler  to  Ih.il  of  an 
adjoining  one.  Hence  tin'  formanl  frequency  i-lianges  weri^  quantized 
according  to  the  £ri:queiu  ies  of  the  filters  in  Hie  bank. 

New  li('(erodyne  fillers  aim  to  eonlro)  anlomatieally  the  filler  center 
frequency  to  correspond  (o  the  peak  of  the  envilopc  of  spectral  density 
output;  and  also  to  i neo rpo r.i (e  automatiu  iiielliods  whieli  would  switch 
transmission  ti'oiii  one  filler  to  the  next  as  necessitated  by  the  movement  in 
frequem  y  of  maxiina  of  spectral  den.sity  out  put  with  lime.  Wliile  elimina- 
(iog  disloi'tiun  introdueed  by  quantization,  such  improveiiunts  also 
prcsL-nl  a  metlioil  for  automatic  iiieasureunnit  of  Xorniant  frequency;  infor¬ 
mation  essential  to  automatic  "phoneme"  reiogniliun,. using  spin  lral 
ilinisity  output.  There  is  still  room  for  iiiiprovement  in  the  reliability 
with  which  forinanl  tracking  is  accomplished  by  such  inethods. 


I ‘16 


Precise  classification  of  phone  groups,  the  other  mahor  field  of 
investigation  related  to  spectral  analysis,  provided  even  further  oppor¬ 
tunities  for  reducing  the  rate  of  transmitting  information.  Earlier 
work  of  linguists  had  already  presented  the  possibility  of  specifying  the 
speech  of  any  Indo-European  language  by  a  small  set  of  elemental  sounds 
called  phonemes.  English  is  said  to  have  between  35  to  42  phonemes. 
Assuming  that  phonemes  are  recognizable  by  certain  gross  character¬ 
istics,  much  as  printed  letters  may  be  recognizable  in  human  hand¬ 
writing  however  distorted,  it  should  be  possible  to  build  a  machine  to 
recognize  these  gross  characteristics.  Although  an  exaggeration,  the 
metaphor  provides  a  partial  analogy  to  the  initial  thouglil  processes 
that  led  to  continuing  research  into  the  precise  identification  of  phonetic 
sounds  by  their  acoustic  characteristics. 

If  such  identification  were  feasible,  the  amount  of  information 
transmitted  about  speech  would  be  reduced  even  more  than  through  for¬ 
mant  vocoders,  since  it  would  be  possible  to  assign  a  code  number  to 
each  phone  and  transmit  only  that  code  to  actuate  a  receiver;  in  Bp.ch  a 
case  the  number  of  pulses  or  bits  needed  for  transmission  is  said  to  be 
about  twenty-four  per  second,  cuinpared  with  over  600  per  second  for 
digitized  formant  vocoders  such  as  described  in  J.  Flanagan's  doctoral 
dis  scrlation. 

As  yet,  however,  it  has  not  been  possible  to  discover  a  completely 
adequate  method  for  grouping  phones  according  to  their  acoustic  data, 
despite  the  initial  success  of  some  limited  purpose  digit  recognizers 
discussed  in  the  following  appendix.  On  reason  for  this  is  that  the 
characteristics  of  each  phoneme  arc  modified  b/  preceding  and  following 
phonemes,  so  Dial  one  phoneme  group  is  likely  ti.)  consist  of  several 
aU()|)hon«!s,  all  sharing  common  characteristics  of  one  sound,  but  each 
slightly  different  from  its  associates. 

The  analysis  of  spimeli  into  its  possible  ailophones  would  in  fa(  I 
result  in  s<>  targe  .a  niimher  of  possible  sounds  that  the  c.ojiipletu  study 
of  all  tlieir  acoustic  correlates  would  require  considerably  more 
tiine  liian  mat  vdii*  li  lias  .already  been  spent  studying  phone  classes, 
considering  the  large  miml>er  of  ailophones,  it  is  indeed  almost  im¬ 
possible  to  uk'Hlify  a  separate  phoneme  by  its  acoustic  patterns  alone, 
particularly  for  the  tu'ansmission  of  speech.  I’hus  hampered,  efforls 
at  reducing  speech  bandwidth  Irainsiuission  through  automatic  phuneini' 
recognition  have  not  eomplelirly  achieved  their  goal.  Specific  pi'obieins 
in  phone  groujiing,  more  directly  rel.aled  to  the  praclit  al  development 
of  speech  recognizers,  are  discussed  in  the  following  appendix. 


197 


Faced  with  many  unsolved  problems  in  relation  to  the  grouping 
of  phones  scientists  have  also  worked  on  alternative  methods  of  trans¬ 
mitting  speech  efficiently.  One  solution  to  the  problem  of  phone  grouping 
is  to  gather  information  about  the  speech  wave  at  regular  intervals  of 
time  without  relation  to  phonemic  segmentation.  This  is  an  extension  of 
the  technique  of  digitized  vocoders  previously  mentioned;  Caldwell 
Smith's  modified  vocoder  presents  a  method  of  speech  transmission 
that  is  considerably  more  efficient  than  most  other  methods  t!iat  are 
ready  for  practical  development. 

In  the  vocoder  developed  by  Smith,  digital  samples  of  quantjz-«id 
energy  levels  from  each  filter  are  fed  to  a  temporary  memory  bank 
in  a  set  frame  instead  of  directly  to  a  transmitter  as  in  a  digital 
channel  vocoder,  xhe  frame  in  the  temporary  memory  bank,  representing 
the  orderly  output  of  each  filter,  is  compared  with  other  such  frames 
stored  in  the  permanent  memory  bank.  The  frame  in  the  permanent 
memory  bank  that  best  matches  the  actual  sample  is  transmitted  to 
the  receiver  by  its  representative  code. 

At  the  receiver  there  is  an  identical  set  of  stored  frames 
representing  possible  digitized  combinations  of  spectral  energy;  when 
it  receive.'i  the  coded  signal,  the  receiver  chooses  the  proper  frame. 

The  orderly  stored  representation  of  energies  on  the  eliosen  frame  then 
actuates  the  bank  of  oscillators;  just  as  if  the  information  about  tliu 
total  frame  liad  been  received  as  in  a  digital  cliannel  vocoder. 

Although  this  process  involves  more  steps  than  in  the  regular 
fUgitized  VO'.- ;<d<;rs,  it  eliminates  the  necessity  of  transinis.sion  ol  a 
larg('  number  of  bits,  since  after  assigning  a  code  to  the  whole  frame  of 
digitized  representations  of  energy  levels,  it  is  necessary  to  transmit 
only  tills  code  tor  the  entire  frame  rather  than  all  the  energy  levels  of 
the  various  fillers. 

fipcech  processed  by  vocodcr.s  is  often  considereil  to  lack 
"naturalnus.s"  of  a  speaker's  "voice  qualities,"  This  is  a  resull  of 
.syiilhi  si.s  of  speech  ;il  the  receiver,  from  information  about  energy  out¬ 
puts  of  about  twenty  fillers  that  divide  the  speech  speelruiu,  into  a  like 
number  of  frequency  bands.  Some  effort  has  been  made  towards  over¬ 
coming  IIk’  above-mentionc'd  linii I alioii.  In  the  beginning,  information 
about  the  spe.aker's  vocal  flap  freqnein  y  and  its  harmonics  was  presented 
cLiung  witli  tile  spectral  density  output  of  liie  band-pass  filtiTs.  More 
recent  deviee.s,  n.amely  llie  voice  excited  vo<  oders  develo|)ed  at  tlie 
Belt  'J'eieplione  Laboratories,  Iransinit  low  frequencies  (i.  e.  those 
lielovv  one  thousand  cycles  per  second)  directly  (without  vucoding)  and 
tin'  ri-si  of  liie  spt‘ei:li  energy  Is  voi'oded  for  Ir.nisrni  ssion  purpost^s. 


1  9H 


Such  a  system  is  reported  to  retain  several  of  the  "speaker's"  voice 
qualities. 

As  is  apparent  from  the  review  above  the  development  of  equip¬ 
ment  for  speech  recognition  has  depended  primarily  on  the  efficient 
transmission  of  a  recognizable  sound-wave.  Such  investigation  has  pro¬ 
duced  a.  considerable  amount  of  data  useful  in  terms  of  a  general  purpose 
recognizer,  at  the  same  time  certain  aspects  not  directly  relevant 
to  efficient  transmission  of  the  speech  wave  have  not  been  given  much 
attention,  and  deserve  further  investigation  for  the  uses  of  our  model. 
There  is,  for  instance,  some  -question  about  the  distortion  caused  by 
passing  a  speech  wave  through  a  bank  of  band-pass  filters;  for  purposes 
of  speech  recognition  our  model  may  require  additional  data  in  the  form 
of  time -amplitude  plots.  The  following  Appendix  considers  some  of  the 
problems  in  the  application  of  present  equipment  to  the  development  of 
general  speech  recognizers. 


199 


APPENDIX  K 

THE  DEVELOPMENT  OF  MACHINES  FOR  SPEECH  RECOGNITION 


Researchers  investigating  machinery  for  speech  recognition  relied 
heavily  on  techaiiiues  developed  for  vocoders,  as  has  been  mentioned;  at 
the  same  time  there  was  a  primary  difference  in  specific  objectives 
between  those  wishing  to  reduce  the  amount  of  information  needed  for 
transmitting  speech  and  those  whose  main  interest  lay  in  using  available 
acoustic  equipment  to  provide  cues  for  speech  transcription.  Thus,  while 
experiments  in  automatic  transcription  of  speech  arc  by  no  means  independent 
of  the  methods  and  data  employed  in  relation  to  vocoding  technique,  the 
goal  of  speech  recognition  should  be  kept  separate  both  conceptually  and 
procedurally. 

The  relationships  and  differences  between  the  objectives  of  speech 
transmission  and  speech  transcription  help  explain  the  statu  of  our  present 
data,  for  example.  Construction  of  speech  recognizers  depends  on  the 
ability  of  electronic  machines  to  "read"  speecii  from  the  gross  character¬ 
istics  of  sound-wave  patterns.  At  preseni  such  patterns  exist  primarily 
in  the  form  of  spectrograms  and  time-amplitude  plots.  Spectrograms  have 
a  definite  value  in  developing  efficient  speech  band-width  compression, 
since  they  form  lh(‘  major  part  of  our  acoustic  information  drawn  from 
previous  research. 

Tliere  is  a  more  direct  coincidence  of  inlerests  between  speech 
transcription  and  transmission  in  the  mattc^r  of  grouping  sounds  for 
automatic  recognition,  although  the  exigencies  of  speech  transcription 
may  demand  a  more  detailed  analysis  of  suitable  phone  classification, 
segmentation,  and  normalization  of  duration  than  is  prebcnlly  provided 
liy  research  ridaliul  to  vocoders.  The  probability  of  such  conflicts  should 
be  kiipl  in  mind  in  the  following  discussion  of  actual  cxperinienLs  with 
speech  rei  ognizers.  Initial  efforts  witli  speech  recognizers  dependcnl  on 
ac  ou.slii  information  available  from  spec  trograms;  sounds  were  related  to 
till;  ac:oustic  patterns  they  produced. 

One  of  file  first  efforts  to  dc-velop  speech  aiTiiated  macliinery 
was  the  automatic  digit  recognizer  ol  tlie  Hell  Telephone  Laboratories. 

Unnwu  as  AtJttRli'V.  This  devi.-.,.  i  urrelateci  llie  significant  patterns  of 
s|)okc'n  digits,  such  a.s  rrc'cpiencie s  and  durations  of  noise  bursts  and 
frc'ciuenc  ie.s  of  forniant.s.  The  succe.ss  of  such  a  recognizer  was  limite  d 
to  aliout  65%  accuraey;  even  when  allowanc  e's  were  made  for  variation  in 
the  average  formant  freciueneies  of  male  and  female  speakers. 

A  more  coiiiulex  effort  to  transc  ribe  speeeh  was  the  aiiluiiialic 
typewriter  eonceived  by  Dr.  Olson  of  KOA  Laboratories,  b'or  this 
dev  elopment  the  information  about  location  and  duration  of  noise  bursts, 


liOU 


the  average  levels  oX  steady  state  formants  as  wtll  as  the  average  transition 
of  formant  frequencies  between  noise  bursts  and  steady  state  formants 
were  programmed  for  recognition  of  phonemes  of  a  preselected  vocabulary. 
The  performance  of  this  typewriter  is  much  harder  to  evaluate  than  that 
of  the  digit  recognizer  discussed  above;  suffice  it  to  say  here  that  it  was 
far  from  that  needed  for  a  phonetic  typewriter. 

Subsequent  to  the  development  of  these  recognizers  Forgie  and 
Forgie  constructed  an  autc  matic  vowel  recognizer,  that  used  characteristic 
patterns  of  the  second  formant  as  a  cue  to  identification. 

Various  disadvantages  of  these  machines  suggested  two  needs  - 
equipment  for  obtaining  mor<^  precise  information  about  the  acoustic 
correlates  of  speech,  and  an  orderly  method  of  grouping  phones  for 
easier  recognition. 

At  present  the  operation  of  automatic  formant  trackers  is  not 
."■.ifficjently  reliable  to  enable  researchers  to  use  for  general  recognition 
a  considerable  amount  of  poblisherl  informalion  alioiit  vowel  sounds  and 
formant  transitions  as  a  cue  to  consonants.  Thu  vowui  recognizer  of 
Forgie  and  Forgie,  for  instance,  relied  on  a  special  definilion  of  formants 
based  on  their  levels  of  energy;  such  levels,  while  present  in  controlled 
experiments,  might  differ  under  different  conditions  of  articulation  by 
diffeJenl  speakers. 

Additional  work,  moreover,  is  needetl  to  specify  (hir  c hfirac Ic’r - 
i.slic:s  of  unvoiced  portions  of  speech  lliat  idenlify  consonant  sounds.  Among 
rectnii  investigators  in  this  field  are  Haskins  haboralories,  Fry  and  Denes 
at  the  University  of  London,  and  Docent  Fant  at  the  Royal  Institute  of 
Teclinology,  Swirden.  Their  research  has  related  tlu:  relative  distribution 
of  energy  in  preseleeled  frequency  bunds  to  the  con.sonanl  i>honenies 
producing  the  patterns.  Results  also  showed  tliat  formant  energy  dis- 
tTibnlion  d<p  ended  on  the  vowels  that  followed  the  consonants.  Fry  and 
Denes  reporled  il  was  possible  for  a  nuu'hine  to  recognize  eerlain  con¬ 
sonants  by  the  above  methods  with  an  accuracy  of  15%  to  90%,  but  tiu’ 
majority  of  Iheir  score  was  closer  to  the  lower  limit. 

Once  re se;i rche rs  had  achieved  a  limited  sneeesK  in  identifying 
.speech  l?y  tlie  gross  cliarni-l  c  I'i  sties  ol  llie  acoustic  jiattern.s,  moreover, 
the  value  of  more  pj’ecise  greniping  of  plioncs  bin  amn  aijparent  -  for 
recognition  as  well  as  transmission  ol  .spcecli. 

Acoiislic  rci  ognition  of  vowels  and  acoustic  rei  oguilioi'i  of  ci>n 
.sonunls  liad  Ijcen  sliown  to  depend  on  different  criteria;  past  researeh 
vvilli  diserele  s|H'eeli  lind  related  vowels  witli  steady  state  formant  levels 
and  consoiiaiil  s,  willi  noise  bursts,  slojr  ga|3s.  and  transitions  between 


^01 


formants.  Experiments  with  early  recognizers,  however,  had  also 
shown  that  phones  could  not  be  identified  by  their  gross  acoustic  patterns 
alone.  Exact  identification  seemed  more  likely  to  depend  on  an  ordered 
analysis  based  on  similarities  and  differences  between  various  phone 
classes.  Such  an  approach  would  provide  one  method  for  organizing  un¬ 
differentiated  masses  of  acoustic  data  into  a  form  comprehensible  for 
available  electronic  machinery.  This  approach  would,  moreover,  pro¬ 
vide  a  feasible  method  of  correlating  the  phonetic  and  acoustic  differences 
likely  to  occur  between  discrete  and  carefully  articulated  speech. 

Grouping  speech  into  categories  based  on  their  phonetic  or 
phonemic  characteristics  has,  accordingly,  been  the  subject  of  much 
scientific  interest.  The  particular  value  of  such  grouping  is  that  it  would 
substantially  reduce  the  number  of  choices  which  a  machine  might  need  to 
make  a.bout  sounds  it  perceived.  Suppose  that  a  transcribing  machine 
hears  the  phono  d  in  the  word  die  .  it  must  distinguish  between  this  word 
and  other  sound  combinations  which  may  be  possible.  There  are  approx¬ 
imately  forty  phone  groups  in  the  English  language,  and  we  assume  that 
each  phone  modifies  the  nron  unciation  of  adjacent  phones.  With  this  set 
of  conditions  the  machine  must  theoretically  choose  between  a  number  of 
sound  combinations  in  the  range  of  forty  factorial.  This  is,  of  course, 
beyond  the  range  of  present  computers. 

Classification  of  sounds,  however,  reduced  the  process  of  choice 
to  a  series  of  separate  decisions  between  whole  categories  of  sound  combin¬ 
ations;  the  machine  decides  whether  a  sound  is  voiced  or  unvoiced, 
wliether  it  employs  the  nasal  passages  to  accent  resonance,  whether  it  is 
articulated  at  the  lips,  alveolar  ridge  and  teeth,  or  back  of  Iho  mouth,  and 
continues  to  make  such  decisions  reducing  the  possible  characteristics  of 
the  sound  until  it  is  possible  to  make  one  final  decision  ideitifyiiig  the 
phone.  This  process  may  reduce  the  number  of  possible  decisions  involved 
from  a  potential  forty  factorial  to  a  number  well  within  the  capacities  of 
modern  computing  machines.  Although  st  ientisls  have  devott’d  consider¬ 
able  attention  to  such  efficient  classification,  no  general  agreement  exists 
about  the  number  of  acoustic  characteristics  necessary  to  identify  a 
phone  cla.ss  for  automatic  recognition  by  mm  liincs. 

A  phonetic  report  on  grouping  titled  "Preliminaries  of  Speceh 
Analysis"  was  eo-aiilhored  l>y  Proffessors  Halle,  Jakobsoii  and  Pant. 

In  this  report  they  presi  iil  twelve  c  liaii'ci  lerislii  s  of  phonemes,  such  as 
voi:ai  ciiord  resoiiance,  nasal  resonance,  frictional  noise  and  so  on,  wliicli 
are  eillier  present  or  absent  in  any  selected  phoneme. 

The  first  full  scale  effort  for  development  of  equipment  for 
automatically  grouping  different  sounds,  however,  seems  to  be  that  of 


7.01 


Professor  Chang  of  North  Eastern  University.  He  tried  to  group  sounds 
as  vowels,  semi-vowels,  nasals,  non-vocal  plosives,  vocal  plosives, 
non-vocal  fricatives  with  low  or  high  energies,  and  vocal  fricatives  with 
low  or  high  energies.  This  separation  of  groups  was  based  primarily 
on  spectral  density  distributions  of  phonemes  in  each  group,  on  presence 
or  absence  of  energy  concentrations  in  the  low  frequency  range,  on  over¬ 
all  energy  for  any  phoneme,  and  on  existence  or  absence  of  a  silence 
before  the  start  of  certain  sounds. 

Although  several  parts  ol  Professor  Chang's  system  were  never 
built,  his  effort  can  be  considered  a  partial  success  since  he  was  able  to 
separate  certain  groups  of  phonemes  with  over  90%  reliability,  whereas 
other  groups  could  not  exceed  70%  reliability  of  separation. 

Following  reports  of  this  work  Dursch  of  IBM  built  a  digit  recog¬ 
nizer  in  which  he  grouped  plionemes  of  spoken  digits  exxcntially  according 
to  Chang's  sysleii..  As  a  major  innovation,  however,  Durseh  used  infor¬ 
mation  in  time  amplitude  waveforms  of  speech  as  well  as  studios  of  speech 
prucessi^cl  through  band-pass  filters.  The  performance  of  Ibis  recent 
digit  recognizer  is  reported  to  be  99%  to  97%  reliable,  comparing 
favarahly  with  ibe  65%  reliability  ritported  for  AUDREY.  It  is  probable 
that  furllmr  develt>pmenls  along  this  line  ol  investigation  will  be  made. 


^0  i 


APPENDIX  L 


Relation  of  Available  Iniormation  to  Acousttc  Correlates  of  Speech 
Necessary  to  the  Developmeat  of  our  Model 

1.  Manner  ol  Articulation 

Data  from  Visible  Speech,  Truby,  and  Haskins  Laboratories  in 
the  form  of  spectrograms  confirm  the  following  descriptions  ot  manner 
of  articulation: 

a.  Stops  are  denoted  by  stop  gaps  followed  by  energy  bursts, 

b.  Spirants  are  characterized  by  continuous  fricative  energy. 

c.  Nasals  are  characterized  by  a  special  low-frequency  nasal 
formant. 

d.  Sibilants  are  indicated  by  high  frequency  energy. 

e.  Affricates  are  denoted  by  a  stop  gap  followed  by  high  frequency 
energy. 

The  acoustic  characteristics  of  laterals  require  the  generation  of 
additional  acoustic  data. 

<1.  Place  of  Articulation 

Tile  primary  acoustic  evidence  reflecting  place  of  articulation 
would  be  tile  frequeney  levels  of  formant  transition.  Generation  of 
further  data  about  this  dimension  of  our  modei  requires  that  we  obtain 
spciUrograins  of  eonsoiiants  sueli  aofjiii,  b,  and  p^  with  different  manners 
of  articulation  but  the  .same  place,  then  moasuru  the  relative  frequency 
of  their  onglides. 

Data  from  Haskins  with  synthesized  speecli,  it  may  be  noted, 
indicatcus  that  the  iormanl  transitions  following  llie  consonant  l”'*!  t'tart 
at  the  same  frequency  level  as  those  for  the  eonsonant  ),.bj  althougli  lliey 
have  different  iiianiiers  of  articulation,  ,Sucli  eviflence,  if  also  tru(^  of 
normal  speecii,  would  indicate  that  transitions  are  cues  for  place  of 
articulation. 

3.  Voiccul  Nasal  Kesonances 

Availabli'  i iifonnalion  inclicate.s  that  voiced  nasal  resonanees  sueh 
as  m ,  11,  and  ij  are  c  harcxcterized  by  a  voice  bar  plus  nasal  foriiiants. 

•i.  Aspiration 

The  aspirated  stops  --  [p \  are  characterized  acoustically 

hy  a  slop  burst  fc>Uo\ved  by  a  period  of  friction. 

5.  Non -di St iu(  t i ve  Consonant  Differonce.s  Depending  on  Following  Vowels 


<;0‘1 


Evidence  ol  different  starling  points  for  second  forniant  ongUdes, 
particularly  in  the  recent  work  of  Bjorn  Lindblom  (Lindblom,  1963) 
show  that  the  articulation  of  a  consonant  varies  non-distinctively  depending 
on  the  following  phone. 

6.  Liabializalion  of  a  Consonant  Before  a  Hounded  Vowel 

Spectrograms  of  •  s  1  before  fw]  show  energy  concentrated  in  the 
unvoiced  portion  at  the  level  of  i  wVs  second  formant,  suggesting  labial¬ 
ization  offs]  before  the  labial  [. wj. 

7.  Duration 

Wc  have  iiieasLired  the  duration  of  vowels  as  suggested  in  the 
first  section:  such  research  suggests  the  importance  of  duration  in 
identifying  vowels,  but  further  data  is  needed.  We  have  no  further 
data  on  duration  differences  caused  by  regional  dialect;;,  and  Lliis  subject 
also  r(?qnircs  invcjstigation. 

y.  Intensity 

Available  information  is  in  llie  form  of  broad-band  spectrograms, 
which  show  only  major  variations  in  intensity.  Further  data  may  be 
available  by  measuring  speech  with  lime-ampliUKle  plots. 

•9.  Fi*c<iuency 

Broad-band  spectrograms  aisti  do  noi  indicat;'  exact  formant 
freqiUMKy;  thus  available  data  does  not  yield  .spi?cific  inlt>rmalion  ab<.>ut 
frcfjuency.  Necessary  data  requires  the  use  of  juir row -band  speclrograms 
and  cros.s  sinlions,  and  of  lime -ampli  tude  j>lols  for  further  research. 

10,  (Consonant  Cluster s 

'1‘ruby  (*rruby,  irulicales  tlial  etinst>nanl  elusters  are  i.(j- 

arUeulaleil  and  that  tin*  arlienlalion  <il'  Ihf’  initial  consonant  muy  not 
afleel  tlu:  articulation  ol  the  following  vowel.  Further  research  in  this 
area  with  timc-ami)li1u<le  plots  and  spi  elral  analysis  s(‘ems  indi<'atefl. 

11.  Nasalizcfl  Liquids  and  Semi-Vowels 

Available  data  yields  iu»  i  nfo  rmal  I  abtail  tin*  aeouslie  (  hararlei*- 
islus  or  I'xlslciue  of  nasalir.ed  licpiids  and  semi  -  vowe  Is, 

lii.  (dassilieauun  <d'  Ihe  y  Phoiu-, 

It  is  worlh  noting,  lluit  allhough  we  ou-iilioii  on  page  Hi  tliat 
'Mliree  years"  and'Mirec*  eiirs"  might  he  treated  as  horuunym  s , 
analyses  of  tiine-.iinplilude  plots  for  Speaker  ^  (i>f  our  own  data)  (Si-e 
j,'ignres  7^  and  7d)  indieate  Dial  y  eari  ije  dL’lei  U'd.  in  "ihn-e  /ears" 


(Figure  74)  there  is  no  change  in  the  frequency  pattern,  but  a  2.  is 
indicated  by  a  decrease  in  the  intensity  of  a  portion  of  the  waveform. 
"Three  ears"  (Figure  73)  gives  no  such  indication;  there  is,  in  other 
words,  no  detectible  y. 


i;u6 


APPENDIX  M 


MEASUREMENTS  MADE  FOR  DETERMINING 
ACOUSTIC  CHARACTERISTICS  OF- SPEECH 

We  perforniod  measui-enients  on  spiel  roi’ raivj s  di.*ii'rminc 
till?  i/equcncy  of  occurrence  of  various  parli>  of  I'lc  sju  rrli  wave  such 
as  the  second  lormanl  onglicle  during  the  arui  ulation  oi  consonant-vowel 
combinations.  The  iiMrcmc  xariations  in  li'.t-  spcctrvi.n  caused  by  iKi' 
articulation  ouone  consonant  with  different  owi  ls  suggi  sis  the  acoiisiir 
intcrdupendcr.i  1-  of  consonant  and  vov.i‘1  arlj*.  iii.ilion  as  indicated  by  their 
organization  incur  nuclei  inrough  placemen;  un  different  planes. 

Datawasi  taken  cxcliisi.vt  ly  from  Vi.-.ilile  S]>(?eeh  ;  lo  obf.-iin  it 
the  following  meliuxJ  w.i.s  use<l:  ,iu-r<isu /etn  -  el  s  of  i  1  lu.sl  rat  ions  of  spec¬ 
trograms  v.'Li‘e  m.icle  with  a  ruler  Vvl.*«ce  smallest  clivi.sioji  was  1/lt.  ’’ 
'riu-se  nieasu remen! s  \V(‘re  then  labui  ited  .-.ith  a  s^iale  gi\en  on  page  l<i 
of  Visible  Speech,  lleiglu  of  ;h“  illunt rat U’r.s  was  approximately  13/ In 
of  an  inch  and  bv  the  gixin  sce.le  1/ib"  n  presented  ZM  cyc.lcis  or  ii  $ 
inilli.suconds.  I  I. us  our  in«asureii.e.UK  frequency  and  cturallon  are 
ri*presc'r.iali\  e  ratliei-  th.in  definiiiw*.  •,  ■  r-*  is  consiCet.iblu  nia.vgii. 
lor  error.  Ii.  *erUnn  cases,  it  w-ih  «.»».  p*-.*.  iblc-  to  tneasnri’  evuty 
aspect  of  t.be  n.gli<le.  .slv-ady-'sJ.ite  ud  ..iferde. 

We  also  measured  s<»iue  (la  *  pe.d  .sle'«l  l>y  Ti’iiby  indicatin;’  Ihe 
of£e(  t  of  final  consonants  on  vosc-d  fre  u'-.,  ^  and  duration;  -iltliough 
m'.ninial  pairs  would  hav«-  been  t!esjr.U)le  .  >r  t  lanpilation  uf  IJ'.is  data, 
they  w  (•  r<*  not .  a\  ;i.i  lable,  In  thes<*  m-  a  i  j->  1 1 )  c  n1  s  we  n ed  <i  ru  le  r 
uhoMc;  siuallesl  m.it  \\<is  one -eightieth.  •/  i.i  iueh.  Ulus!  r  i!  ions  from 
I  ruby.  however,  are  smaller  than  thwj.e  rooi  Visible  bpeeclp  aei  ording 
folds  Seale  one-»d gbt i el  h  o(  an  I lu  h  v  on  o  .  (^  ja)  J  0  ^  ey i- 1  e  s  and  i;ne 
s i  xt  (•  e ntb  of  n  i  no b  v.  i  m  hi  et j»i a!  'IS  •  > •..  ! <  , 


APPENDIX  N 


Finally,  we  performed  measurements  of  the  proportionate 
relationships  between  onglide,  steady-state,  and  off-glide  for  four 
words  taken  from  the  work  of  Lehiste  and  Peterson.  The  investigators 
include  no  scale  for  their  illustration,  but  the  relative  durations  of  1^1, 
and[_ij  form  one  important  criterion  for  distinguishing  between  these 
sounds;  luus  the  data  from  Lehiste  and  Peterson  snows  that  duration  is 
an  elernriit  important  to  the  evaUlation  of  acoustic  data  by  a  general 
speech  t  rails'  fiber. 


cr  < 

ft  0 


ET 
(0 
3 

I— ♦  i«« 

o  a 
tn  o>  ^ 

U>  <ft  - 
(A  Vk  VI 

^  r'  p 

w>  *— 


•r  n 

»*  !'• 

'O  vrj 

2 

rt  cr  ^ 

3  (t  V, 

&  H- 

X  P  H* 

s  3 

^c®  S 

xi" 

A.  ft) 


(ft 

3 

OQ 


(ft  2^  <ft  e 


(ft  /’  (ft 

^  ^  a! 

p  ^  ^ 

3  ■  -■  V 

(I-.  • 


V. 


ro 


I 


A 


.s 

bo 

4) 


s 


i—H  H  rH 

rt  3 

•4  •“  4 
"a  >''S 

■  H  60  4 


a)  tl  « 
-c  B 
H  a>  H 


I'i 

i-H 

rt  V 

4  ■o 


10 

g 

«  -rt 

a  1 


oo  d 
B 


w  a 


^  4 

si  2 

O  b 

2  S-l 

60  P. 


+4  (JJ 

a 

ft)  ^ 

ft 

rt  _ 


»>  §  s 

e 

S  o 

o  B 


s 


V  n) 

S 


(n 


«  rH 

a  *S 


w  h  3 
ft  flj  M 
0  0)  bO 


4) 

h  O  - 
O  A  " 


2^  fl 

^  S."' 


a> 

A 


0.  - 

9  - 


V  .  05  W 

44  «  V  44 


iS  .  JS  §  2  "7  .1  . 

‘  -  o  0  if  r 

'M  w  o  V 


0  Uri 


3  r°  G  ^  ^  (5 


4) 

ii 


00 

Jl 

o 

B 

tU 


<y  — »  ^ 

^  (M  CO 

™  w  CO 

^  i-f 


.3 

00 

a> 


ft)  ^ 

N 

0  X 


^  .3 


Oi 
J3 
M 

H  4) 

3  a 

U  ft 
0  < 


s  « 

g-l 

■M 


^  -H 

(\}  -P 

44  M 
44  H 

0  P 


oo 

<  nj 


intrinsically  associated  with  it  ;; 
(Visible  Speech,  p.  30),  it  appea 
a  spectrogram  simply  as  a  stop 
lout  a  voice  bar  followed  by  a  sto 


APPENDIX 


DISCUSSION  OF  THE  RULES  OF  I'.UPHONIC  COMBINATION 
SUPPORTED  BY  PHONETIC  TRANSCRIPTION 

A.  Texts  Used 

The  passages  ased  in  our  analysis  of  continuous  speech  are  taken 
from  two  sources  as  mentioned  in  the  main  body  of  our  text;  a  record  by 
Jackie  Gleason  (Decca  ZlbS-i)  and  tapes  of  natural  conversation.  One 
side  of  the  Jackie  Gleason  record  is  entitled,  "What  Is  a  Boy?"  and 
the  other  side,  "What  Is  a  Girl.  "  The  tape  recordings  of  e onverr.ation 
were  made  by  Ur.  J.  M.  I’ickett  of  the  Air  Force  Ga'ubridp.e  Re  search 
Laboratoiies;  Dx-.  Ph  kett  kindly  let  us  make,  copies  of  them. 

Each  side  of  the  Jackie  Gleason  record  lias  bc:cn  divided  into 
four  parts  and  a  number  assigned  to  each  part.  Every  time  we  cite  a 
liassagc  from  tliis  record,  we  give  the  number  of  th<'  part  in  wliich  it 
appears. 

On  tile  side  culled,  "What  Is  a  Boy"  Part  I  begins  with  "Between" 
and  ends  with  "and  Heaven."  I’art  11  begins  with  "protects  them"  and 
ends  with  "Paul  Bunyan,  tlie".  Part  Xli  begins  with  "shyness  of"  ami 
ends  witli  "nobody  else  can".  Part  IV  begins  with  "e.rani  into"  and  ends 
with  "Dad.  " 

On  tlie  side  called,  "What  Is  a  furl".  Part  I  begins  with  ".Liltle" 
and  ends  witli  "special  look".  Part  U  begins  wltli  "in  her  eyes"  and 
ends  witli  "softness  of  a".  Part  HI  begins  with  "kitten"  and  ends  witii 


J  I  1 


"flirtatious"#  Part  IV  begins  with  "when  she”  and  ends  with  "of  all", 

^  The  passages  from  the  AFCRL  tapes  which  contain  our  examples 

are  quoted  below;  before  each  passage,  we  give  a  brief  description  of  its 
context#  Each  passage  has  been  assigned  a  number, 

I 

( 

Passage  I 

Conversation  about  anechoic  chamber  with  girl  who  s  lid  shiS 
was  majoring  in  tlie  psychology  of  education, 
i'emale  voice:  's  fascinating.  It  looks  like  an  attic. 

Male  voice:  Some  people  come  in,  first  remark  tliey  make  is,  "My 
ears  seem  to  feel  funny, 

Fenuile  voice:  My  ears  didn’t  feel  fun))y  but  um  iih  spcccli  sounds  a  little 
bit  different,  sort  of  muffled, 

Male  voice:  Yes,  if  you  uh  clap  (clapping  sound), 
lie  male  voice:  Yeali, 
iVlaie  voice:  Sort  of  ,  ,  , 

l.''eiuale  voice:  Yeah,  '  ts  fumiy, 

i-’assage  II 

1  Conversation  about  the  word  list  witli  the  girl  wliu  said  slu'  was 

Tiiajoring  in  government, 

Male  voice:  We  use  tliose  in  a  way  to  calibrate  our  speecli  system,  since 
\vi!  riglit  now  can't  put  a  little  thi  .  .  .  something  like  a  voltmeter 

on,  we  ...  we  have  to  test  our  system  witli  speecli  itself, 

I 

i 


I 


Passage  111 

Conversation  about  regional  accents  with  girl  majoring  in 
government. 

Male  voice:  Well,  don't  you  sometimes  stick  r's  in  when.  .  . 

Female  voice:  Once  in  a  while. 

Passage  IV 

Conversation  about  courses  required  for  major  in  government. 
Male  voice:  Is  this  partly  city  planning?  I.  .  .  I  don't  really  know.  .  . 

F('niale  voiev:  No. 

Male  voice:  Tli(*ory  of  government? 

l-'eniale  vtuce:  Uh.  well,  this  year  it's  kiiul  of  a  geiu'ral.  .  . 

l‘ktphoni<  (Jumblnat)t)n  .Substantialed  or  .Suggested  by 
PhonHjc  "I'ranse riptioii 

NuinlxT.s  in  pa  r(‘n(l)es('s  afl'T  rn)e«  «  licrl  in  this  <iivi.sion  rtd'er 
to  tlu'  at^pctHlix  where  the  rule  may  be  I’unuil.  In  eNuiupIr.';  cited 
Indow  tlicse  rules,  (he  b r.o  t*-*-! '‘d  Irllers  fepre.-’ent  plioiu  iii  .sym¬ 
bols  aicordiug  to  Hm*  modifit-d  s(andar<l  hile  rnation.il  Idiinu'iicK 
AssocijUuni  .Mjdi.ibet  (  In  a\  **j‘daiK  ('  witli  Ihe  mnsonanl  i  .itc’gories 
lisN-d  ir  \ppendi.N  !»).  'I'lie  phomdic  sla(<’mci)(  is  Icdlowi'd  Ij^' 
an  <'.\iimj)le  draum  rruiu  our  I  r.insc  rjpti(ju.  I’he  nolations 
"Boy  I''.'liji'l  11"  ,  <*r  "AF(d<L  tapes  HI"  rerer  respet  I  i  s  e  ly  to 
speeii’ic  lext.s  td  "What  is  a  Boy?".  "What  i.s  a  (lirl  '"  or  (he  recordings 


of  conversation  from  Hanacom  Air  Force  Base.  Thus  the  notation,  [pj 
becomes  Jn  j  between  the  (Boy  I),  means  that  an  alveolar  n  becomes 
a  palatal  n  as  exemplified  in  our  transcription  of  the  phrase  between' 
the  contained  in  the  first  textu.al  segment  analyzed  from  "MThat  is  a  Boy?" 

1,  Change  in  Place  of  Articulation: 

(a)  Before  or  after  a  dental  an  alveolar  may  become  a  dental, 

^n]  becomes  j^nj  between  ihe'  (Boy  I) 

becomes  [n  J  with  noise  (Boy  I) 

(b)  Before  or  after  a  palatal  consonant,  a  dental  or  alveolar  nasal 

may  become  a  palatal  (Appuiulix  Jil,  I). 

J  becomes  ^n]  cnj oy  (Boy  I) 

^n]  becomes  [n]  when  you  come  home  (Boy  I),  It  is  interesting 
to  note  that  the  phrase  when  you  are  busy  h'as 
[p]  rather  tlian  [nj,  aitliough  the  environment 
is  titc  same, 

2,  Chejige  in  Resonances: 

(a)  A  voiceless  consonant  may  become  voicerl  next  to  a  voiced  con¬ 
sonant  or  between  vowels  (A|tpeii<Jix  H,  HI) 


u: 

1  biicomes  )  g  1 

comic  books 

(Boy  III)  The  final  consonant  of 
rather  than  ^  kj. 

comic  is  [gj 

[s 

j  becomes  [zj 

across  the 

(Boy  III) 

l‘J 

becomes  [_d  J 

top  it  all 

(Girl  III)  The  final  consonant 
of  it  is  l^dj. 

[t] 

becomes  |^d  ] 

fascinating 

(AFCRL  tapes  I) 

[t] 

becomes  [d  ] 

attic 

(AFCRL  tapes  I) 

ItJ 

1  becomes  ^d  j 

little 

(AFCRL  tapes  I) 

^16 


^s]  becomes  [zj  once 


(AFCB.L.  tapes  m) 


(b)  A  voiced  consonant  may  become  voiceless  next  to  a  voiceless 
consonant  (Appendix  H.  Ill), 

[b]  becomes  [p  ]  absolutely  (Girl  IV) 

£^vj  becomes  j^fj  of  string  (Boy  III) 

3.  Sound  Drop-outs 

(a)  An  alveolar  stop  between  two  consonants  may  drop  outt 
after  [nj  (See  Appendix  H. 


and  the 

[acn  3i  ] 

(Boy  I) 

and  colors' 

l^aen  kxlazj 

(Boy  I) 

didn*t  feel 

[d»dn  filj 

(AFCRL  tapes 

I) 

sounds  a 

^savn  zi] 

(AFCRL  tapes 

I) 

after  [^s  j 

best  clothes 

|bt8  klovzj 

(Girl  I) 

must  not 

^mas  nat^ 

(Girl  IV) 

first  grade  j^fji  s  grexdj 

(Girl  III) 

after  anotlier  stop 

protects 

^pro  ttksj 

(Boy  U) 

aftor  a  spirant 

ti  oftnc  b  H 

j^ssX  nxbj 

(Girl  U) 

(b)  An  alveolar  stop  before  an  affricate 

may  drop  out. 

before 


straight  chairs  j^strei  ctrzj  (Girl  III) 


ai7 


after  j^nj 

in  her  »  nsr] 

(Girl  U) 

after  [s  j 

grasshopper  iBraes  sa  pal 

(Girl  n) 

L  " 

after  j 

lock  him  j"  la  kim  j 

(Boy  IV) 

4,  Sliortening  of  long  consonants 

(a)  When  two  ide.jitical  consonants  conii’  together  tJiey  form  one  long 
consonant  wliich  may  he  shortoncU  to  llie  normal  length  of  a 
singal  consonant*  This  ruler  also  aff<rcls  long  consonants  produced 
hy  other  rules  of  euphonic  combinatioji,  ( Anpentlik  U.IX) 


siH'cial  look 

Jspt/a  IvkJ 

(i  lixl  1) 

malees  soon* 

tiling 

S71jJ  (I’.oy 

Bets  so 

[gtt  soj 

(Iloy  III) 

gMMiflrop::  six  c.rhts 

(hoy  IV) 

j^^AUl  (ll'.'ip  f. 

Tk  sentsj 

.viipr  rsuiui;  (. 

•U<U’ 

c 

< 

[1*  ya  Jii 

kovdj 

(b)  ChnnbluaUuiis  ol  ideiitie'.ai  coiisonanis  whieh  a.r  e  the  result  of  ot)ier  rules: 

in  y.t  dour  ^n£k.  slJrJ  ;  |ilj become sj^tj and  the  long[tjis  shorte.neel 

(Girl  Ul) 

knives  saw s  rnaxv  sjzJ  •  1.  r.  J  becomes  [  s3  and  the  long  [  s  J 

is  sliorlened.  (Hoy  Til) 


Z  I  H 


i 

1 

■ 

trains  Saturday 

deij  i  [ a j  becomes [b^  and  the  long 

Ijtreln  sae  ta 

[sj  is  shortened,  (Boy  III) 

I 

SO 

%  soj 

[aj  becomes  [s  J  and  the  long  ^s] 

1 

i 

is  shortened,  (Boy  Ill) 

i 

1 

bedtime 

[be  taimj 

[d]  becomes  ftj  and  the  long  [tj 

b 

1 

is  shortened,  (Boy  III) 

1 

ears  seeni 

^ir  sim] 

[aj  becomes  [b]  and  the  long  fsj 

i 

is  shortened,  (AFCRL  tapes  I) 

1 

% 

bit  different 

[bi  di  frytntj 

[tJ  becomes  [d  J  and  the  long  [d  ] 

i 

it 

I 

is  shortened,  (AFCRL  tapes  I). 

B.  Euphonic  Combination  of  Final  Consonants  witli  Following  Phones: 

The  final  consonant  of  one  word  may  attach  itself  to  the  initial  phone 
of  the  following  word,  (Appendix  U.  IX) 


■Wo  have  so  many  examples  for  this  type  of  sound  combination  tliat 
wc  have  not  listed  tlicin  all.  Examples  of  such  combination  from  careful 
speech  cited  below  are  found  in  part  (lioy  I)  of  tlic  Jackie  Gleason  record. 
For  the  AFCRL  tapes  of  convorcatioiial  speech  we  list  all  examples  found 
in  passage  I. 

Jackie  Gleason  record; 


innocence  of 

[iinosxn  ssvj 

babyhoo(J_and  jbeibihv  daen[] 

fini;^ 

[fa*  n  d  e  j 

come  in^ssortedj^A  mx  nssoatidj 

weights  and 

[wcit  S9nJ 

8econd_p^very  [sfken  da  vfvrjj 

oJ_yvery 

[p  v^vrij 

hourjj^very  [av  ra  vtvrij 

their^only 

[th£  rovnlyj 

tliein^ff  [Un  m:f  J 

boys  are 

J - 1 

N 

M 

0 

found  gverywhere^avn  dtvriwir^ 

11  V 


f 


top  ofjjnderneath  inside  of 

climbing  on 

^kiaimx  :jan) 

Lta 

VAnd%ni  l^xnsax 

dAv]  ' 

running  aroun^or^'Ani  jjaraun  doj^ 

mothers  love 

[mA?s  zlAvJ 

sisters  and 

pistt  zanj 

adults^  gnore 

[?ae  dAl  tSIgnoiT] 

AFCRL  tapes: 

'^fascinating 

[sfac  si  nci  di^ 

look  s _like  jin  atti 

.c^vk  slat  ka  nac  d 

cotpe  ^in 

h®  ininj 

first  remark 

[fars  tri  mark] 

make:  i  i, 

|jiiei  KKzJ 

ears  didn't 

Qa  zclr  dn] 

sounds  a 

pacjn 

if  jfou 

[l  fyu  ] 

sort  of 

[.sor  toj 

'ts  funiry 

[t.sfA  nij 

6,  Olotfi.!  stops 

(a)  K  glottal  stop  may  la:  iiitrocluccd  before:  a  word  starting  a 


VO’.vrl  (Appendix  II.  I', I 

bi'OOMH's  adults  (l’'Oy  1)  Ibore  i.s  a  glottal  stop 

before  tlie  initial  vowol  of  ailnlt.s» 

(b)  glottal  .'.top  may  la-  substitJtn.l  for  jbetore  a  labiaj  consonant 


(  . . .  .  (  I.  •.  ) 

^tmj  . . .  [?n,]  voltincU?!' 


(Ai''C,]<J<  tapes  II)  Thu  consonant 
bt^t'ovr  tliu  in  ib  a  glottal  stop 


ratjiur  tl»an  a 


[‘> 


T,  flonsonant  clii.slcrs  involving  ^  ;  Tlu*  rombinatiou  o£  followed  by  j^vj 
may  bucoi^ju  Ij'j,  (A;  p.  ndix  ;i,  U) 

[ai/irj 


this  year 


(AbTIlll,  tapes  IV) 


8<  Treatment  of  [r^  ia  New  England  Dialects 

(a)  An  [r]  after  a  Towel  may  drop  except  when  [r]  stands  at  the  end 
of  a  word  and  the  next  word  begins  with  a  yowel*  (Appendix 
(1)  Loss  of  l^x]  between  a  vowel  and  a  consonant: 


their  last 

[»£  1— t] 

(Boy  X) 

fire  cracker 

[fax a  kraekaj 

(Boy  HI) 

Retention  of  [r  J 

between  two  rowels: 

their  only 

[thl  roun  li] 

(Boy  I) 

fire  engines 

jfal  r£n  y^nzj 

(Boy  m) 

(b)  In  some  New  England  dialects  [r]  may  be  inserted  between  a  word 

ending  with  a  rowel  and  a  word  beginning  with  a  vowel.  (Appendix  11.  IX] 
law  of  [is  rAv] 


(Girl  U) 


APPENDIX  Q 


Discussion  of  w  phone  class: 

Researchers  at  Haskins  Laboratories  have  reported  that  an  initial 
ouglide  of  at  least  50  milliseconds  duration  that  begins  at  or  close  to  the 
[w'?  locus  will  be  perceived  as  a  'Avjl  (Liberman,  Delattre,  Gerstman, 
and  Cooper,  1956).  Observations  like  Hiis  are  important  to  the  successful 
construction  of  stylized  formant  patterns.  But  we  object  to  the  assumption 
that  such  styli;ied  patterns,  produced  through  nieclianical  methods,  can 
be  used  to  provide  irifurniatioii  about  the  acoustic  characteristics  of 
actual  bpt^ech.  After  a  review  of  the  data  we  recently  generated,  however, 

\vi‘  are  able  to  make  a  conclusion  aliout  ihe  luiu  tion  of  ouglide  duration  in 
the  <l-.  terminatjon  of  w:  for,  the  duration  ui  the-  onglidc'  was  not  found  to 
be  consistently  greater  for  tlie  phrase  witli  w  than  for  llu*  phrase  without  it. 
For  Speakers  1  and  5  "no  ax"  has  the  shorter  •onglifh^  if  ue  do  not  include 
the  duration  of  tin-  pause,  but  for  .Spi’aUer  I  "nu  \Srix"  is  the  sliortcr  onglifU- 
(bi'e  Figures  ti\  ~  «i6).  So  the  inili.il  onglide  of  ,Sj>eak e rs  1  and  5  tend  to 
validali'  the  Patli-rn  Playbaek  const  ructions  <.l  Ihiskins  Laboratories:  but* 
till'  i'xani|it<.*  of  .Speaker  Z  makes  it  very  diflL  oil  for  us  to  ci>nsidcr  initial 
onglide  duvailoM  rneasu  renient  !;t>lulion  lor  the  irlrui  i  i  i  cal  ion  of  w  -  •  as  i<  ru  I  e  ■ 

1  la  ski  ns  t  "Uils  to  defi  m-  i  In  eh  i  r:  <  t  e  r  i  r  1 1 .  i,  uJ  ..pr  ec  h  (r/  e''e  uiati  ng 
tie-  judgen  ent  of  lisli-tiers  listi  niug  !<•  th  spot  i  h  pallerns  Masking*  produces 
oil  the  Fa'  o‘ rn  Ida ybac k .  I’ul  the  pe r  •  i  •.  i  ion  « .|  spi-i-i  h  j)al I  ••  I'ti s  does  not 
h  '-i  <ssa  r  i ! )  prov  ute  us  v\  1 1  h  i  i;f<;r:  u.iti  i  n  aooii  t  i  hi'  re.i  1  i  li  e  h  of  speech  pro  - 
duciioti:  our  outdi<h'  duration  analysis  the  plioix-  intlii  ales  tliis. 

I'll'  at.l.t  iMpI  I  • »  «  oiup  reheli.l  I  h«  .i  •  ei  I  » i  >  o  1'  i' t  1 ,  1 1  c  ( ;; )  of  W  IS  i’ll  I' t  he  I* 

I  ONI  pi  it  all  d  bv  tl-'  Lot  that  \v  r  hase  is  ri-fvl  1  an  I'.s.i  in  pi  c  ol  initial  [  w  ’ 
vsiilioiit  ans  app.iriiii  slo. id-.  do:  ihis  i u.  o*  v.oj'il  "will"  nl.tered  liy 
.S|  ii  .  1 1;  (■  r  ‘i  \  tie  .i-et  -  n.  <•  ''Will  %  i  -u  hi  Ip  i "  [Si  i  l-'i ;;  u  r*  ■  11).  1 1  a  ppea  r  s 

f  h  ( 1  '*  wj  i  .  lo.i  ;-l  .  d  o'.  -I  h. »!:  •  :  =  ■  •  ;  I f  a'  <  ..n  i  i  •)»  .  U’oiiU!  o  at  s  illlfl  a  I  oil  r. 

'  'll  '  I  id<'  i  e  •  >1  lu  i-  I  n  rl  he  r  ro  •  «  r«  h  i :  u  - •(  *1.  <j  i n  di  t .  rm  i  no  i  n  whii  )i 
'  il  [ei;i  a  1  .iij  -••L.ii-K  Old  ui  a'  ■  .1  ha,!',  oi.  mi'Ii  .  1  cfore  our  \vorl< 

1 ! Is  I  IMUM  I  loD.i  I  I :•  e .  1*1  > i.d  ioi,  p«  rt . noui  .  1  . •  , .  ;  o  .i '  ■  ,  i m  «  one  li  -  si  \  i  . 

<  .  coi-i  I  ur<il  I  s  d  C-.  p.  .-.M  III  I  i  }.,d  al  1  '  ij"  • p  .1  r«  :  « I  by  long 

I'll  'lidos  v'.liib  .ill  I.  .11  <iiiii.il  ‘  '.si'};  a.*v’  ii.n'ci d  by  ba.i,  ..li  .ol  -stall  s. 


ACKNOWLEDGEMENTS 


Wc  gratefully  acknowledge  the  interest  of  Mr.  Caldwell  Smith 
and  his  recognition  of  the  importance  of  the  present  research;  thanks 
are  also  due  to  the  Air  Force  Cambridge  Research  Laboratories 
for  initiating  this  project. 

We  further  acknowledge  the  help,  information  and  guidance 
given  by  Mr.  Smith,  through  discussion  of  other  work  on  speech 
processing  and  its  relation  to  our  model  concepts.  He  also  made 
it  possible  for  us  to  process  our  speech  data  at  AFCRL. 

We  also  acknowledge  the  assistance  of  Mr.  Weiant  W  ath.en -Dunn, 
Dr.  J.  Id.  i-'ickett,  and  Mr.  Philip  J.,ibernian  of  the  .AFCRL  sp<  ech 
Rese  irch  L.iboralory,  who  provided  us  with  tapes  of  normal  con- 
versut.onal  .speech  and  helpful  comments  pertaining  to  the  cl('Scrii)tion 
of  normal  vemver.satioaal  .speech,  ns  (list  inguislied  from  the 
I  iiunciation  of  is<dale(l  words. 

Tlie  tim.  -<iiii|>Utude  plots  of  sneeeh  were  Jiiade  al  Hu  Minneapolis 
none\nell  be  ililies  in  iluston  by  arrangement  with  Mr.  .\tcCi?rty  and 
Mr.  Mara,  I'lu  ir  l•o(^pl■I•^llion  and  ludp  is  grati  fully  n  i ognir.i  d. 

Wi  .ils.)  appm  i.ile  tile  (Dope  rati  on  ol  Uie  llarvard  Dramatit 
Soeietv,  wl’.(,  provided  speakers  for  tlie  generation  of  onr  data. 


Till  [iriiu  ipal  eontribulor  lo  this  effort  w.u.  H.  V.  Uhimaui; 
the  iinguistie  .(spei  is  of  .speech  were  sliidicd  by  Mrs.  ,M,  F,  f), 

Degges.  The  mcilheiiiati(  .il  foriiiuTr.i ons  and  syinb(di(  representatinn.s 
were  performed  and  (  hecked  by  Mr.  k'rank  Rubin  and  Mr,  Mark 
Doum.a.  'I'hr  v  riling  of  the  monthly  reports  wa.s  done  by  Mr.  Gordon 
Millie,  Mr.  Gerald  lliliman,  and  Miss  Margery  I  ,ov.  cube  rg,  Mrs. 
.ioelh  .iarker  al.so  aided  in  the  preparation  of  monllily  reports; 
in  addition,  she  oretmi '/.ed  and  edited  the  linal  report. 


BIBLIOGRAPHY 


Bhandarkar,  R.  G.  First  Book  of  Sanskrit  ,  (33rd  ed,  Bombay,  1957.) 

Bloch.  Bernard,  "Phonemic  Overlapping,  "  American  Speech  ,  (1940),  16 

Bloomfield,  Leonard,  Language,  New  York  (1933) 

Bolinger,  Dwight,  "A  Theory  of  Pitch  Accent  in  English"  Word. 

(1958),  14 

Buck,  Carl  D.  ,  Comparative  Grammar  of  Greek  and  Latin,  Chicago  (1933) 

Classe,  Andre,  Rhythm  of  English  Prose,  Oxford  (1 939). 

Cooper,  F.  S.  ;  Delattrc,  P.  C.  :  Liberman,  A.  M.  :  Borst,  J.  M.  : 

and  Geratman,  L.  J.  .  "Some  Experiments  in  the  Perception 
of  Synthetic  Speech  Sounds.  "  Journal  of  the  Acoustical  Society 
of  America  (1952),  24 

Delattre,  P.  C.  :  Liberman,  A.  M.  :  anc  Cooper,  F.  S.  ,  "Acoustic 
Loci  and  'i'ransitional  Cues  for  ConsonanlB,  "  Journal  of  the 
Acoustical  Society  of  Americ.'i,  27!' 4  (1955). 

Denes,  P.  ,  "Effect  of  Duration  on  the  Perception  of  Voicing,  "  Journal 
of  the  Acoustical  Society  of  America,  (1955),  27 

von  Essen,  Otto,  Allgemeine  und  Angewandte  Phonetik,  Berlinc  (1953). 

i'ant,  Guniiar,  Acoustic  Theory  of  Speech  Production  ,  15  -  Cravembage, 
Mouton  &  Co.  ,  (1960). 

Fant,  Ounnar,  and  I.indbloni,  Bjorn,  "Sludie.s  of  Minimal  .Speech  Sound 

Units."  Speecli  Transmission  Laboratory ,  Quarterly  Progress  and 
.Status  Report,  2/1961,  Royal  Instilvile  of  Tt^rhnology,  Stockliolm, 
pp.  1  -  11. 

"Ifirst  tlrainmatieal  Treatise,"  edited  and  lianslaleii  by  Elinar  JJaugen, 
Languagi!  ,  SuppIetTuml  to  Vol.  26:  4,  Oct.  -  Dec.,  1950. 

Fisclier-Jsirgenson,  Kli,  "Aeouslic  Analysis  el  .Stop  Consonants,  " 
Misi'cllaiiea  Plioiietica,  (1954),  2. 

Idourciuet,  A,  Les  mutations  eon.sonaiitiqnes  du  gt:rmanique  ,  Paris  (1948). 

Fry,  Dcuinis,  "Experiments  in  the  Pereeplion  of  Stress,  "  Language 
and  5p<'ech  ,  No.  2,  (1958.)  1, 

lliirrell,  Richard,  "Some  English  Nasal  Arliculalions,  "  Language 
(1958),  34 


223 


Harris,  Katherine  3,  ,  "Cues  for  the  Discrimination  of  American 

English  Fricatives  in  Spoken  Syllables,  "  Language  and  Speech, 

(1958),  1 

H'ciisler,  Andreas,  Altislandischos  Elemontarbiicli,  4th  edition, 

Heidelberg  (1950),  (reprint  of  3rd  edition,  1931). 

Jackson,  Kenneth,  Language  and  History  in  F.arly  Britain  ,  Zdinburgli 
(1953).  '  ■“  ^ 

Lclusto,  Use,  An  Atoustu:  -Phunotic  Study  <>i  liuurniil  .I'mictnro. 

supplonn-nt  !u  HlioiuMii  <u  (1960),  S. 

Lclusli?,  llsij,  and  Pt-k*rs*»n,  Gordon  E.  .  '‘Dui*<itioti  of  SyllabU*  Niu*h*i 
in  Engllsli.”  Sludi4-jj  in  Syllable'  Niu  It- i  I  .  Spi  i-i  h  Re bi’avcli 
I-*aboraio ry .  Ann  Arb(U*.  Mi*  hi/.'tin, ,  ( 1  900), 

Liolii  s l c .  Use,  aiul  Pi  iiTson.  Gordon  E,  .  "  T r-iriHi I i oju» ,  Glides,  and 
niplithonw  s ,  Studies  in  Syilable  Nin  U*i  Z ,  Spot't  h  lie  scarcdi 
I ,al>(;r.i lo ry  ,  Ann  Arboj-.  iVliv.liigau,  (I'k'O). 

and  r\'di.*rsioi.  llalj'OJ.  .\  G  .m  i>.c  GtHuparcit is  v  Ccltle 
Cci'a'inuH I*  .  Cioiiiugrn.  (l''  ,V). 

[  ,i  1h‘ r  iiui  n .  a\,  iM.  ;  iJclaltrc,  P,  ;  and  (.'oopor.  K.  S.  .  "'I'lu'  Knlc  ()f 

,S(?b  <dfd  Slmndns- Variabli'H  ip  *iu*  <  plion  '»i'  iln  '^wo.’C  'd 

^.lo/(\»  ni  .s ,  •  Am*-''’  .'ll  !•>:“  •’  '*  1  i *1),  '■’•j, 

. . .  in  I'vt.  :  J>i  I  itln-.  I\  :  uul  <..H)pci\  i*'.  S.  .  "Soiin’  C'urs  I’vU' 

slit  Di.sl.iit  iioh  bolsst  ri.  \.»i(  (ol  and  Von  i-lrf.h  Slops  in  Initial. 

I  '<  -  si  I  ion  .  ’  I  ,4i  ll5•nn .i  nd  Spion  li .  ( I  '*  I  , 

1 ,1  in  rnirj  II  A,  '.i,  .  I  )«•  1.1 1 1  I'r .  P.  ;  s  ic  rst  n  u- ,  I S.  ;  -ind  ( ^ot.pi  i' ,  I*'.  .S.  , 

'"I  o.oo  .  !•  s  .g'  o  t  t  ••  I'O  I  b  :.l  uiu'.n  i^li  i  ng 

(.1  .'.sf*.  ..i  Sp.  .  t  I,  S./oimI}..  "  .lonmai  nl  K.spi- 1*11110111.11  Pfiyclioloyy , 

(10. o). 

I  ,i  ndbloiii .  itjorn,  Gn  Vovm  I  KiaUn  lion.  Rniiort  No.  .Spno  h  'J'i*«inMni  ssitm 

!  (ii  bo  r.t  f  o  I'y .  l\oy  a  i  I  nsl  i  In t  <■  i.'l  T'-v  Inn  d'  *uy  ,  Slot  kliol  m  ,  ( 1  9b  ‘i), 

Lij.l-'l'.  I .(  lull.  ’’idoMu’o  Duration  ami  ibi  J  i  it  .•  rs  m.i  I  u  \  «>n  ttl  ••  Voi  in  Ics}; 

Di t  on  1 1  on  in  I'.nj'li.'di,  "  Ltingmi^o  ,  (l'b»7)  “I  5. 

M ,i  f.'l  i  lint .  '\ml  I't  .  *'  1-  n. m  I  it.jn ,  Si  r  u* In ro  .  and  Sound  G.Imul'i  ,  W  o rd  , 

V(-l.  S.  I9‘G.  I  ^ 

Martiiu'l,  Andjn-.  Ki  onoiiiin  dns  v  lianpo’iH- nt  j.  plioinl  i  .  lU  I’ni*  {I9bs) 

M  in i-.ii  1- .  j,.  ,  .4;^, I  I  r<la  .  A..  K« r  1 1  kn  l.i  1  n  in  .  Sli-uruMe,  und 
J_,;oT  la  bti  1’*  r./.uiii^  ,  Dnpliii  aiul  iiiani  (IS  i>). 


Meyer,  Ernst  A.  ,  Engliache  Lautdaur,  uppsaia,  (1903) 

Niedermann,  Max,  Precis  de  phonetique  historique  du  latin,  3rd  edition, 
Paris  (1953). 

O'Connor,  J.  D.  :  Gerstman,  L.  J,  :  Liberman,  A.  M.  Delattre,  P.  C,  : 

and  Cooper,  F.  S.  ,  "Acoustic  Cues  for  the  Perception  of  Initial 
/w,  r,  j,  1/  in  English,  "  Word ,  (1957),  13. 

Petorson,  Gordon  E.  and  Barney,  Haiold  L.  ,  "Control  Methods  Used 
in  a  Study  of  the  Vowels,  "  Journal  of  the  Acoustical  Society  of 
America,  Vol.  24,  1952,  175  -  184, 

Sharf,  Donald  J.  ,  "Duration  of  Post-stress  Intervocalic  Stops  and 
Preceding  Vowels,  "  l,anguage  and  Speech  ,  (1962),  5 

Streitberg,  W,  ,  Urgcrmanische  Grammatik,  Heidelberg  (1895) 

Thomas,  Charles  K.  ,  An  Introduction  to  the  Phonetics  of  American 
English,  New  York  (1947), 

Tliurneysen,  Rudolf,  translated  by  Binchy,  D.  A.  and  Bcrgin,  Osborn 
A  Grammar  of  Old  Irish  ,  Dublin  (1947) 

Wliitncy,  William  D,  ,  Sanskrit  Grammar,  Cambridge,  (1889). 


225 


LIST  A 


Code 
AF  5 

AF  18 

AF 

AF  M 

A  1^  <t  4 

AF  IZ4 

AI'  I  4‘; 

AF  <1-1 

Ar  ■« 

Ar  ’} 

Ar  T() 


OrRaniaation 

AFMTC  (AFMTC  Tech  Library-MUl  35 
Patrick  AFB,  Fla.  -for  unclassified  material 

AFMTC  (MTBAT) 

Patrick  AFB,  Fla.  -for  classified  material 
AUL 

Maxwell  AFB,  Ala. 

OAR  (RROS.  Col.  John  R.  Fow.'.er) 

Tempo  D 

-1th  and  Indepindence  Aveniu 
Wash  25,  D.  C. 

AFO,SR,  OAR  (SRYP) 

Tempo  D 

-1th  and  Indtipendi-iu  e  Avenue 
Wash  25,  D,  C. 

ASD  (ASNXRR) 

WriRlil- lAitle  rson  AFB,  Ohio 

RADC  (RAALD) 

Attn:  Doeunients  I.ihrary 
Cii-iffiss  Al'B,  New  York 

AF  Missile  nevelopmeiil  Center  (MIXiRT) 
Ihillomali  .M-'B,  New  Mexico 

n<|.  OAR  (RUY) 

Attn;  .lames  A.  l''av;i.  Col.  U.SAF 
Wa-I,  25,  p,  C. 

CoiniiiaiuliiiR  Cene  ra  I 
U.SA.SRDl, 

Ft.  Motuiioulh,  New  .lei'sey 

Attn:  Tei  h.  Doi  .  Ctr.  SICK  /SL-ADT 

Departiiieiit  of  llu-  Army 

01fi<  e  ol  the  Chief  Sii-iial  Officer 

Wa.sh  25,  p.  C. 

Altn:  ,SKGKP--la-2 

A 

Commanding  Offii  er 
Alin:  ORDTL  -0  12 

Piamond  Ordnance  ru:-,-  Laboratorie.s 
VVa.sh  25,  D.  C. 


No.  of  Copies 


1 


1 


I 


1 


1 


1 


1 


I 


I 


1 


List  A  -  Page  2 


Code  Organization 

Ar  67  Redstone  Scientific  information  Center 

U.  S.  Army  Missile  Command 
Redstone  Arsenal,  Alabama 

G  2  Defense  Documentation  Center  (DDC) 

Cameron  Station 
Alexandria.  Virginia 

G  31  Office  of  .Scientific  intelligence 

Central  Intelligence  Agency 
2430  E  Street,  N.  W. 

Wash  26,  D.  C. 

G  6H  Scientifii  and  'reclinical  Information  Facility 

Attn:  NASA  Representative  (S-AK-DL) 

P.  O.  Box  6700 
Belhesda.  Maryland 

(!  109  Direelor 

Ijangley  Research  Center 

National  Ai’ronaulics  and  .Spac  e  Adinini slralion 
Langley  Eield.  Virginia 

M  6  AECRl,,  OAK  (CRXRA  -  Slop  i9) 

1..  G.  Ilan.seoni  l''ield 
Bedford,  Mass. 

M  77  Il(|.  AECUL,  OAR  (CRTK,  M,  B.  Oillierl) 

L.  (i.  llatiKi'Oin  Eield,  Bedford,  Mrtss, 

M  Hi  ll'|.  AECRL,  OAR  (tiR'I’I’M) 

1..  (1,  ll,iiiseoin  i''ii*ld,  Bedford,  Mass, 

.N  V  (,nu  l,  i'nrcau  ot  Naval  ttcepons 

Department  of  tlic  Navy 
\\  ashington  26  .  1).  ( ;. 

Attn:  DLl  tl 

N  20  Dircctcr  (Code  2027) 

U.  ,S.  Naviil  Rc.seaiaii  Lab\» ivilo ry 
Wash  21.  D.  C. 

1  292  Director.  IfSAE  I'rojei  l  RAND 

'I'lic  Rant!  Corporation 
1700  .Main  Street 
Santa  Monica,  California 
illRU:  AE  Liaison  Olfit  o 


No.  of  Copies 


1 


20 


1 


1 


1 

20 


1 


1 


2 


1 


List  A  -  Page  3 


Code 
U  443 


AF  318 


Ar  1U7 


G  8 


M  6l 


M  84 

N  V  i 


U  V 


U  4  M 


Oraanlaatlon 

Inalitute  of  Science  and  Technology 
The  University  of  Michigan 
Post  Office  Box  618 
Ann  Arbor,  Michgan 
Attn:  BAMIRAC  labrary 

Aero  Res.  Lab.  (OAR) 

AROL  Lib.  AFL  Bldg.  450 

W right- Patterson  AFB,  Ohio 

U.  S.  Army  Aviation  Human  Research  Unit 
U.  .S.  Oontini-ntal  Army  Command 
P.  O.  Bu-x  4.18.  Fort  Rucker,  Alabama 
Attn:  Maj.  Arne  14.  Klia.ssoti 

l.ibra  ry 

Bonloer  l,abora(o rie.s 
National  Hiireaii  of  Standards 
Honlder,  Color.ado 

Institute  of  the  Aerospace  .Scii  lices,  liic. 

Z  ]'!ast  teltli  .Street 
New  York  Z\ .  New  York 
.Min:  l.ibrarian 

AFCKL,  OAR  (CRXR,  .1,  K.  Marple) 

L.  <;.  l!an::com  Field,  licdforcl,  Mas.s 

Offit  t*  ol  N.ival  H»-secar(  li 
Itr.'ue  b  Offi,  e,  l.ondoti 
Kaw  1(10.  Bos  iO 
F.  P.  O.  Nev.  York,  N,  Y. 

Maiis.icluisel I s  In.'ititule  of  T’eclinology 
Ri'se.iia  h  L.aboraloi*/ 

Building  rih.  Room  l<i? 

Caiiibridge  i'.l,  M.is.s. 

Attn:  .lohn  II.  Hewitt 

Aide  rijiaii  LitM'a  rv 
Nr,:vi*rsi(y  Virginia 
(’harlotit  ■.s\  1*1“  ''  : 


No.  of  Copies 


1 


1 


1 


Z 


1 


1 


I 


1 


List  A-Page  4 


Code 
G  6 


G  9 


J.,ist  G 
U  449 

U  /‘lU 

(I  4-G 


Organigation  No.  of  Copies 

Scientific  Information  Officer 
British  Defence  Staffs 
Defence  Research  Staff 
British  Embassy 

3100  Massachusetts  Avenue,  N.  W. 

Washington  8,  D.  C.  ^ 

Defenci'  Research  Member 

Canadian  Joint  Staff 

£■440  K4assachusetts  Avenue,  N.  W. 

Wa.shington  8,  D.  C.  - 


Profi  ssor  Roman  Jakobson 
Ma.ssachnsotts  Institute  of  Techr.olegy 
577  Massachusetts  Avenue 
Clambridgi’  59,  Mass. 

MasHaehusett!!  Ine.tit.ile  of  Technology 
577  .Massachusetts  Avenue 
G  iinbl'itle.e  59,  Mass. 

Attn;  Dr.  Kennetli  N.  Stevi  ns 

Reseaiih  I  ,abo r.i*  1  r y  of  filer i ri'nic s 


■loint  .Speec  h  Research  Unit 
I'iaslooli'  l<o:ol 

Rui'lip,  Midilleeex.  England 
Alto:  Dr.  .i.  .Sw.iflielil 


AEG  HE,  OAK 

(GHL'.S,  (:aldv.ell  I’.  Smith) 
f..  G.  Ilanscom  EieUI. 
Def'iord.  Mass. 


LIST  G 


Codp 
AF  1 

AF  20 

AF  21 

AF  22 

AF  .17 

AF  11 

AI'  V'l 

A  !■'  12  1 

'\l''  1/(1 

Al'-  1/1 

\]-  1) 

Ar  2  0 

AF  n/. 


Organization 
Hq.  ESD  (AFSC) 

Operational  Applications  Laboratory 
Attn;  (ESRH) 

L.  G.  Ilanscom  Field, 

Bedford,  Mass. 

AFSC  (SCSFD) 

Andrews  AFB 
Wash.  Zb,  D.  C. 

Hq.  ESD  (AFSC) 

Attn;  ESUD,  Lt.  Coi.  Sidney  W.  Sheets 
L.  G.  Hanseoni  Field, 
lledlord,  Ma.ss 

AFORU  -  OS/C 
■Min:  Major  F.  T.  Garrett 
Room  1C '067.  'I'he  Pi  iitagon 
Wash.  Zb,  D.  C. 

USAF  Si  i  iirity  .Servii  e  (SED  -  2) 

.Sail  .Antoni. >,  'I'l  .sas 

llq.  USAF 
AFOCC  -  MU 

(.Mtn:  Major  W.  K.  WinhigI  ■(•  -  Rooni  IB'lhS) 
Pentagon . 

Wash.  Zb.  I).  G. 

KADC  (HCUKI>,  Mr.  Kieliard  ('..  Menoit,  ,Ir.  ) 
Gritli'.s  Al'M.  N.  Y. 


AFM,  (  WWC) 

W  righi  ■■  I ’.lUersoM  AFM.  Olilo  ■IS'lll 

I'l  |■le,l|•oni,  .Sy.'.leno  Divlsioil 
(!■  ■S.'-ilil-  .s,  l.l.  Got.  'f.  v\,.rnt. ) 

12  I  Tr.ipeli.  Ro.iil 
v\  .ilth.in:  -el  M.i.ss 

KAIh:  (RGUAI) 

(tritti... .s  .-vi-'i;.  N.  V. 

Chill  e  III  111,-  i.hii  i  Sigiitil  Gti'ii  er 
Giiliini.inil  .-Mid  Goiurol  Sy  ...I  e,,i  s  Division 
.SiG.SD  -  11. 

Attn:  Mr.  A.  I..  W  .i  re 

W  ii.sli  ,  I).  (;, 

Cluel.  U.  .S.  -Arniv  Set  n  ri  I  y  Ageiu  y 
•Arlington  Hall  .Station 
•Arlington  12.  Virginia 
Attn;  .AC.iMS,  G  i .  'l  l,  Si'i  lion 

I*. .SI)  {I'.SRG.  1 .1 .  l-.lrod.  Stop  IS) 

1..  G.  Il.insioni  Fielil.  Medford,  .V1.  s.s. 


No.  of  Copies 


1 


1 


1 


1 


1 


1 


1 


I 


1 


I 


1 


List  G  -  Page  2 


Code 
Ar  21 

Ar  85 

G  16 

G  30 

G  32 

G  12Y 

G  I2H 

I  'I 

1  I  1 

I 


Oi'Kanization 

Commanding  General 
USASRDL 

Fort  Monmouth,  New  Jersey 

Attn;  SIGFM/EL-NX-4,  Mr.  H.  E.  Lacy 

Comtiianding  General 
USASRDL 

Fort  Monmouth,  New  Jersey 
Attn:  SIGFM/EL-NRM 

Defense  Communications  Agency 
Attn:  Code  433 
Wash  25,  D  C. 

Director 

National  Security  Age:icy 

Fort  George  G.  Me.ade,  Maryland 

Attn:  R12,  Mr.  I’osenbloom 

fiPO  Research  Station 
Dollis  Hill 

London  NW2,  England 

Defense  Comninrileationu  Agency  (321) 
.Attn;  M/,Sgt.  David  M.  Ihimphri  y 
Wash  25,  D.  C. 

Defense  Goniiiinnii  alions  Aginicy 
Code  Y22.  Alin:  ItohiM't  M.  ,Si  utt 
Wash  25,  D.  t;. 

Dell  'rideltlione  I  ,a  ho  Sit  ( f  >  r  il*  s  .  itw, 

Wurr.iy  I'ill,  New  Jersey 
Alin;  Dr.  M. ud  red  .Schroeder 

Hell  're|e|»hone  L.d  >0  r.»  to  I'i  e  s  .  Inc. 

W  hipjia n y  I  ,,i hi ►  r.ito  ry 

VS  hi  |>1>,|  II  j  .  New  Jersey 

iVlln:  liihihi.il  Inforination  Library 


No.  of  Copies 


1 


1 


1 


I 


1 


1 


ll.iidi.tns  I  .ahor.ih.  ri es  .  Inc. 
MI5  !•  .ltd  I  ird  Slreel 
New  York  if.  New  York 
.Altn:  Dr.  !'.  .S.  (;i,..|,ei 


I 


Li  si 
Code 
I  70 

1  90 

I  9H 

I  lO'l 

I  1  10 

I  I  Hi. 

[  19.; 

1  19  J 

J  .i.M 


G  -  Page  3 

Organiaation 
Melpar,  Inc. 

3000  Arlington  Boulevard 

Falls  Church,  Virginia 

Attn:  Dorothy  A.  Allen.  Librarian 

Litton  System.',,  Inc. 

221  Cre.scent  Street 
Waltham  5‘1,  Mass. 

Attn:  Dr.  George  Sebestyan 

Signals  Ue.search  fj  De vi’lopuient  FstabUshmc'nl 
Christcluun  h.  Manls,  England 
.Mtn:  Walter  Lawrimce 

Sylvania  I'ileetrie  Prcxluets.  Im  . 

100  First  .Avenue 
W.iUliam  h-l.  Mass 

.Attn:  Charles  A.  ’riioriih.ill .  lleport  Lib ra I'i.'in 
W.’ilth.'iin  ] ,.'il)i> rato ri es  l.ibr.iry 

Autuneli.  s  Division 
Norili  .Amcriitin  Avi.ilion 
Whillie  r.  Cjihrornia 
Alin:  Dr.  J.  D.  Bledsoe 

VVcsIinghou.se  Kle<  lvi(  t'.oi'p. 
t'.  led  roni.  s  Di  vi  sion 
.’.••19  WilKins  Avenue 
B.illiinore  1,  M.ilyiand 
Altn;  C.  II.  .VI,  Adi.' 

ISliiiii.mi  K.  s.  arih  Ass. n  isles 
I ‘3  ill  M'lssa,  hn:  ell  s  .Avenm- 
I  ,1  ■  X  i  n  <1 1 1 1  It  N/1  I  u  1- 
Alin:  Dr.  B.  V.  Bhim.iin 

iNaiional  Dash  Kigister 
.Attn:  Mr.  Klaus  Ollen 


Dhilco  Corporalion 

(communications  Eied  r,>ni e s  DiN'isien 
Attn:  K.  It.  McMithael 
■li'OO  U  i  ssaiiic  kon  ,A,'**nu,‘ 
i^hil.idel|jliiti  i-i.  Pa. 


No.  of  Copies 


1 


I 


I 


1 


1 


I 


1 


1 


List  G  -  Page  4 


Code 
I  266 

I  301 

.1  347 

I 

I  .3'JH 

i  ('i'll) 

1  /2H 

1  V  Id 

M  ‘iD 

N  1‘' 


Organiaation  No.  of  Cogies 

ITT  Federal  Laboratories 
Technical  Library 
500  Wa.shington  Avenue 

Nutley  10,  New  Jersey  1 

Sylvania  Electric  Products,  Inc. 

Applied  Kesearch  Laboratory 
Sylvan  Koad,  Waltham,  Mass. 

Attn;  Mr.  Harold  Manley 

Phiico  Corporation 

Advanced  Communications  Engineering  Dept. 

Philadelphia  44,  Penna. 

Attn;  U.  W.  Steele 

Melpar,  Inc. 

.3000  Arlington  Boulevard 
i’aUs  Cliuin  li,  Virginia 

Attn:  Cainiraet  Administration  Oepa  rtni.nt  1 

ri'T  Communications  .Systems,  liu  . 

P.araiiius.  New  .lersey 

Attn;  Curtis  M.  .lansky  1 

Bolt,  Ber.mek  K  Newman,  lin'. 

Attn:  Dr.  K.'irl  Kryli-r 
Moulton  .Street 

Caiuli  i  idg" ,  M.i;-!.  1 

Di' l'ei.'...|e  |.  ! . -I- 1  roiii e  I’rodiie t  r. 

K.iilio  Ciii'por.ition  of  Anirri<,i 
Attn:  Willard  E.  Meeker.  HUig..  1  i  11 

('..'Itodeli  2,  Me.,v  .leij.ey  1 

1  lu);lo  CoomnoiK  .ilion;.  Divo.'.ion 
1 ( >.  Bos  'lootl,',.  Air|i.»rl  .Slalion 

I -os  Ang.ele:.  4..,  C.  • ! 1 1 . . nii a  1 

.M*'  Elei  tronie  .Systems  Divi.-.hin  (ESNt.3) 

•Sl’ACKCCM  (Attn:  'lull,,  Maj.  Cmirlney) 

■124  Traoelo  Koad 

W.iUh.mi.  S4.  M.iss  1 

U.  S.  N.'iv.il  Kesearili  Jjahor,il,ory 
Code  S-llg,  Attn:  Barry  I-.  Sihuliger 
VVa.'di  2'-,  D.  C. 


1 


Code 

Orearii  nation 

No,  of  Copidu 

N  35 

Commanding  Officer  and  Director 

U.  S.  Navy  Underwater  Sound  Labovatorx 
tort  rruiubull,  New  Jiomion.  ( lonntHiticvit 

Attn:  Mr.  O.  M.  I>unn 

1 

N  oil 

Materia).  l.a)>(»rutojt y  library 

Ihuldinn  i!tM  Code 

Nf'W  York  Naval  Shipyard 

Hi'ooXlyii  1  ,  New  York 

1 

N  1  fV 

U.  S.  Nroal  l-.li-i  tl'oiileit  1  .alit  1  IMP '  ry 

/tnn:  Dr  J„|,„  w,  h!.Ier 

Shu  ( '>all  toriii.i 

1 

(1  ! 

lonveniily  •>)  C,i  U  ,• 

}>r|ia  rlliirtd  ol  1'  liuil  e*- rl  li>.> 

Ui  I'H  e  Ir  y  ■! .  Ca  U  f  o  rni.i 

A  Mm:  Aiili'iina  1  ..dio  im  to  r  \ 

1 

1)  I'l 

1  *  1  ol .  .1  o'diiia  W  h>t  i  1  ooM  r  li 

1  /  <  enf  IM  1  M  I'l  i  t 

U  1  iH  In-  *.(  1  1- ,  M.i .  , 

1 

(1 

M.oo.,0  liio.il)..  I.i'dit'ili  1  .  .  hill  .)■ -r  . 

I  .1  ill  ol  ||  1  •  P.  .  , 

I St  lu' lot,.  M.i'.-, 

\tlM.  III.  I.ii.i*  l  o  1  ro 

1 

(1  .HI 

o|'*’ii  h  1  . . lilts  .loll  j  1  tf  •  1  1  1  <  1  1 

tioy.il  he.htnle  ..I  |  i  i 

pi )  <«  1  [m  ■  1 1 1 1  1  0  .  1  li  •  It 

^Mii  Mr.  (  iMiiiiii  1  1' ml 

1 

11  1(1.', 

1  t.i  rv.i  1  il  1  lilt  V  !•  r  .  M  , 

1  '•«  loin  1 1  Hepo  rt  s  i  .  dWo  Mon 
(e*ri|oii  M«K.iv  (•itirar.' 

•  liiA  I'wne  IhtU.  t  V\iorit  j,j  ri  1  1 
(>•illdll'ill^(  ia, 

Atlii:  liilivai  laii 

1 

U  .•■  Kt 

Uiiivi'i  Hity  Colleep- 
t’hoiirtie li  Mepartiiient 

Attn:  Mr.  Adrian  k'liuM.  nu* 

(josv  er  Street 

Contlon  \\G-I.  lldijiland 

